SlideShare uma empresa Scribd logo
1 de 18
Bilingual Term Extraction
with Big Data.
Research project with the Zurich University of Applied Sciences (Switzerland).
Conducted by Mark Cieliebak.
Rémy Blättler
Chief of the System
Developer.
Co-Founder.
Chief of the System.
In Zurich, Berlin and Los Angeles.
In English, Spanish, Portuguese, German,
Chinese, Japanese and 80 other languages.
• Simple word frequencies
• Using dictionaries to find “phrases”
Naïve approaches
• Shallow, two-layer neural networks trained to
reconstruct linguistic contexts of words
• Free & open-source
GENSIM word2vec
breakfast cereal dinner lunch => Cereal
GENSIM word2vec
man => boy woman => girl
Sweden
 Norway 0.76
 Finland 0.71
 Estonia 0.54
New York City
Terms and Conditions
GENSIM Phrases
Never Follow (Audi)
Just do it (Nike)
1. Average occurrence of a term over all corpora
2. Average occurrence of a term for one client
3. Same for the other language
Term extraction 1
Detect the specific phrase in the source & target:
Geänderte Segmentberichterstattung erhöht
Aussagekraft.
Improvements gained from changes
in segment reporting.
Term extraction 2
Plugin additional algorithms
Works well for small fixes too
GENSIM
Upper-
case
Weighted output
Entity
extraction
Data corpus
Remove
cities
German English
myCloud myCloud
Wie kann ich How can I
Swisscom Broadcast Swisscom Broadcast
HomepageTool HomepageTool
Abo subscription
Swisscom (Schweiz) AG Swisscom (Switzerland) Ltd
Swisscom Sharespace Swisscom Sharespace
App app
Healthi Healthi
KMU SME
SharePoint SharePoint
Endresultat Swisscom.”
Ihre Webseite your website
KMU SMEs
Dateien Files
Swisscom myCloud Swisscom myCloud
generelles Rauchverbot cablex."
Dateien Photos
inOne mobile inOne mobile
Business Apps Business Apps
The big picture: fast & easy client onboarding
Webspider
(with PhantomJS)
(N)MT-based
alignment
Term extraction
Structured text in
multiple languages
Translation
memory
Terminology
database
• Speed (tests take multiple hours)
• Insufficient data (>50k TM units helps)
• Bad source data (HTML, Javascript, etc.)
Problems
$1,000 KTI Research Project
$20,000 SpinningBytes & University Cooperation
=> $20,000 more until online & 80h+ internal
Budget
Questions?
@remyblaettler (Twitter)
remy@supertext.ch
Bilingual term extraction with Big Data (and Gensim)
Bilingual term extraction with Big Data (and Gensim)

Mais conteúdo relacionado

Semelhante a Bilingual term extraction with Big Data (and Gensim)

Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
 
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...Maryam Farooq
 
Six Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open SourceSix Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open SourceDirk Riehle
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
Agile Development Practices - Productivity
Agile Development Practices - ProductivityAgile Development Practices - Productivity
Agile Development Practices - ProductivityAlex Moore
 
Shift Left, Shift Right and improve the centre
Shift Left, Shift Right and improve the centreShift Left, Shift Right and improve the centre
Shift Left, Shift Right and improve the centreAugusto Evangelisti
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptxSaravanaD2
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningmustafa sarac
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Making Sense of Selenium
Making Sense of SeleniumMaking Sense of Selenium
Making Sense of SeleniumSmartBear
 
Fake news detection
Fake news detection Fake news detection
Fake news detection shalushamil
 
Learn Real World Machine Learning By Building Projects
Learn Real World Machine Learning By Building ProjectsLearn Real World Machine Learning By Building Projects
Learn Real World Machine Learning By Building ProjectsJohn Alex
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...Edge AI and Vision Alliance
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
DevSecOps without DevOps is Just Security
DevSecOps without DevOps is Just SecurityDevSecOps without DevOps is Just Security
DevSecOps without DevOps is Just SecurityKevin Fealey
 

Semelhante a Bilingual term extraction with Big Data (and Gensim) (20)

Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...
NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...
 
Six Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open SourceSix Easy Pieces of Quantitatively Analyzing Open Source
Six Easy Pieces of Quantitatively Analyzing Open Source
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
 
Knowledge-Driven NGS Solutions
Knowledge-Driven NGS SolutionsKnowledge-Driven NGS Solutions
Knowledge-Driven NGS Solutions
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
Agile Development Practices - Productivity
Agile Development Practices - ProductivityAgile Development Practices - Productivity
Agile Development Practices - Productivity
 
Shift Left, Shift Right and improve the centre
Shift Left, Shift Right and improve the centreShift Left, Shift Right and improve the centre
Shift Left, Shift Right and improve the centre
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptx
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learning
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Making Sense of Selenium
Making Sense of SeleniumMaking Sense of Selenium
Making Sense of Selenium
 
Fake news detection
Fake news detection Fake news detection
Fake news detection
 
Learn Real World Machine Learning By Building Projects
Learn Real World Machine Learning By Building ProjectsLearn Real World Machine Learning By Building Projects
Learn Real World Machine Learning By Building Projects
 
tem7
tem7tem7
tem7
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
 
Whole test suite generation
Whole test suite generationWhole test suite generation
Whole test suite generation
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
DevSecOps without DevOps is Just Security
DevSecOps without DevOps is Just SecurityDevSecOps without DevOps is Just Security
DevSecOps without DevOps is Just Security
 

Mais de Remy Blaettler

The Lean Startup (Theory & Motivation - Examples & Learnings)
The Lean Startup (Theory & Motivation - Examples & Learnings)The Lean Startup (Theory & Motivation - Examples & Learnings)
The Lean Startup (Theory & Motivation - Examples & Learnings)Remy Blaettler
 
Emotional Webdesign am Internet Briefing 2013 in Bern
Emotional Webdesign am Internet Briefing 2013 in BernEmotional Webdesign am Internet Briefing 2013 in Bern
Emotional Webdesign am Internet Briefing 2013 in BernRemy Blaettler
 
Multilingual Websites - UX Camp Europe 2013
Multilingual Websites - UX Camp Europe 2013Multilingual Websites - UX Camp Europe 2013
Multilingual Websites - UX Camp Europe 2013Remy Blaettler
 
Multilingual websites - UX Camp Berlin 2012
Multilingual websites - UX Camp Berlin 2012Multilingual websites - UX Camp Berlin 2012
Multilingual websites - UX Camp Berlin 2012Remy Blaettler
 
Online CAT and project management tools for translators
Online CAT and project management tools for translatorsOnline CAT and project management tools for translators
Online CAT and project management tools for translatorsRemy Blaettler
 
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011Remy Blaettler
 
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010Make the user smile - Emotional Webdesign - UX Camp Berlin 2010
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010Remy Blaettler
 

Mais de Remy Blaettler (7)

The Lean Startup (Theory & Motivation - Examples & Learnings)
The Lean Startup (Theory & Motivation - Examples & Learnings)The Lean Startup (Theory & Motivation - Examples & Learnings)
The Lean Startup (Theory & Motivation - Examples & Learnings)
 
Emotional Webdesign am Internet Briefing 2013 in Bern
Emotional Webdesign am Internet Briefing 2013 in BernEmotional Webdesign am Internet Briefing 2013 in Bern
Emotional Webdesign am Internet Briefing 2013 in Bern
 
Multilingual Websites - UX Camp Europe 2013
Multilingual Websites - UX Camp Europe 2013Multilingual Websites - UX Camp Europe 2013
Multilingual Websites - UX Camp Europe 2013
 
Multilingual websites - UX Camp Berlin 2012
Multilingual websites - UX Camp Berlin 2012Multilingual websites - UX Camp Berlin 2012
Multilingual websites - UX Camp Berlin 2012
 
Online CAT and project management tools for translators
Online CAT and project management tools for translatorsOnline CAT and project management tools for translators
Online CAT and project management tools for translators
 
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011
Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011
 
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010Make the user smile - Emotional Webdesign - UX Camp Berlin 2010
Make the user smile - Emotional Webdesign - UX Camp Berlin 2010
 

Último

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Último (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Bilingual term extraction with Big Data (and Gensim)

  • 1. Bilingual Term Extraction with Big Data. Research project with the Zurich University of Applied Sciences (Switzerland). Conducted by Mark Cieliebak.
  • 2. Rémy Blättler Chief of the System Developer. Co-Founder. Chief of the System.
  • 3. In Zurich, Berlin and Los Angeles. In English, Spanish, Portuguese, German, Chinese, Japanese and 80 other languages.
  • 4. • Simple word frequencies • Using dictionaries to find “phrases” Naïve approaches
  • 5. • Shallow, two-layer neural networks trained to reconstruct linguistic contexts of words • Free & open-source GENSIM word2vec
  • 6. breakfast cereal dinner lunch => Cereal GENSIM word2vec man => boy woman => girl Sweden  Norway 0.76  Finland 0.71  Estonia 0.54
  • 7. New York City Terms and Conditions GENSIM Phrases Never Follow (Audi) Just do it (Nike)
  • 8. 1. Average occurrence of a term over all corpora 2. Average occurrence of a term for one client 3. Same for the other language Term extraction 1
  • 9. Detect the specific phrase in the source & target: Geänderte Segmentberichterstattung erhöht Aussagekraft. Improvements gained from changes in segment reporting. Term extraction 2
  • 10. Plugin additional algorithms Works well for small fixes too GENSIM Upper- case Weighted output Entity extraction Data corpus Remove cities
  • 11. German English myCloud myCloud Wie kann ich How can I Swisscom Broadcast Swisscom Broadcast HomepageTool HomepageTool Abo subscription Swisscom (Schweiz) AG Swisscom (Switzerland) Ltd Swisscom Sharespace Swisscom Sharespace App app Healthi Healthi KMU SME SharePoint SharePoint Endresultat Swisscom.” Ihre Webseite your website KMU SMEs Dateien Files Swisscom myCloud Swisscom myCloud generelles Rauchverbot cablex." Dateien Photos inOne mobile inOne mobile Business Apps Business Apps
  • 12. The big picture: fast & easy client onboarding Webspider (with PhantomJS) (N)MT-based alignment Term extraction Structured text in multiple languages Translation memory Terminology database
  • 13.
  • 14. • Speed (tests take multiple hours) • Insufficient data (>50k TM units helps) • Bad source data (HTML, Javascript, etc.) Problems
  • 15. $1,000 KTI Research Project $20,000 SpinningBytes & University Cooperation => $20,000 more until online & 80h+ internal Budget

Notas do Editor

  1. Cluster documents and classify them by topic Sentiment Analysis Recommendations, e.g. CRM
  2. It gets complicated if a segment has multiple extracted phrases. => Find other segments with the same phrases
  3. Next step: Replace Weighted output with Neural Network. Whats stopping us? Not enough Terminology data to train with. Maybe a public system would help.
  4. Examples DON’T PROOFREAD