Bilingual term extraction with Big Data (and Gensim)

•Transferir como PPTX, PDF•

1 gostou•255 visualizações

Remy Blaettler

Short description on your approach to extract terminology data from translation memories with gensim.

Dados e análise

Bilingual Term Extraction
with Big Data.
Research project with the Zurich University of Applied Sciences (Switzerland).
Conducted by Mark Cieliebak.

Rémy Blättler
Chief of the System
Developer.
Co-Founder.
Chief of the System.

In Zurich, Berlin and Los Angeles.
In English, Spanish, Portuguese, German,
Chinese, Japanese and 80 other languages.

• Simple word frequencies
• Using dictionaries to find “phrases”
Naïve approaches

• Shallow, two-layer neural networks trained to
reconstruct linguistic contexts of words
• Free & open-source
GENSIM word2vec

breakfast cereal dinner lunch => Cereal
GENSIM word2vec
man => boy woman => girl
Sweden
 Norway 0.76
 Finland 0.71
 Estonia 0.54

New York City
Terms and Conditions
GENSIM Phrases
Never Follow (Audi)
Just do it (Nike)

1. Average occurrence of a term over all corpora
2. Average occurrence of a term for one client
3. Same for the other language
Term extraction 1

Detect the specific phrase in the source & target:
Geänderte Segmentberichterstattung erhöht
Aussagekraft.
Improvements gained from changes
in segment reporting.
Term extraction 2

Plugin additional algorithms
Works well for small fixes too
GENSIM
Upper-
case
Weighted output
Entity
extraction
Data corpus
Remove
cities

German English
myCloud myCloud
Wie kann ich How can I
Swisscom Broadcast Swisscom Broadcast
HomepageTool HomepageTool
Abo subscription
Swisscom (Schweiz) AG Swisscom (Switzerland) Ltd
Swisscom Sharespace Swisscom Sharespace
App app
Healthi Healthi
KMU SME
SharePoint SharePoint
Endresultat Swisscom.”
Ihre Webseite your website
KMU SMEs
Dateien Files
Swisscom myCloud Swisscom myCloud
generelles Rauchverbot cablex."
Dateien Photos
inOne mobile inOne mobile
Business Apps Business Apps

The big picture: fast & easy client onboarding
Webspider
(with PhantomJS)
(N)MT-based
alignment
Term extraction
Structured text in
multiple languages
Translation
memory
Terminology
database

• Speed (tests take multiple hours)
• Insufficient data (>50k TM units helps)
• Bad source data (HTML, Javascript, etc.)
Problems

$1,000 KTI Research Project
$20,000 SpinningBytes & University Cooperation
=> $20,000 more until online & 80h+ internal
Budget

Questions?
@remyblaettler (Twitter)
remy@supertext.ch

Bilingual term extraction with Big Data (and Gensim)

Mais conteúdo relacionado

Semelhante a Bilingual term extraction with Big Data (and Gensim)

Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe

NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...Maryam Farooq

Six Easy Pieces of Quantitatively Analyzing Open SourceDirk Riehle

NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq

Knowledge-Driven NGS SolutionsDrDavidLangenberger

AI-Powered Linguistics and Search with Fusion and RosetteLucidworks

Agile Development Practices - ProductivityAlex Moore

Shift Left, Shift Right and improve the centreAugusto Evangelisti

REVIEW PPT.pptxSaravanaD2

building intelligent systems with large scale deep learningmustafa sarac

VOC real world enterprise needsIvan Berlocher

Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri

Making Sense of SeleniumSmartBear

Fake news detection shalushamil

Learn Real World Machine Learning By Building ProjectsJohn Alex

tem7guest69032c

"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...Edge AI and Vision Alliance

Whole test suite generationJPINFOTECH JAYAPRAKASH

Open vocabulary problemJaeHo Jang

DevSecOps without DevOps is Just SecurityKevin Fealey

Semelhante a Bilingual term extraction with Big Data (and Gensim) (20)

Using Bioinformatics Data to inform Therapeutics discovery and development

NLP Community Conference - Dr. Catherine Havasi (ConceptNet/MIT Media Lab/Lum...

Six Easy Pieces of Quantitatively Analyzing Open Source

NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...

Knowledge-Driven NGS Solutions

AI-Powered Linguistics and Search with Fusion and Rosette

Agile Development Practices - Productivity

Shift Left, Shift Right and improve the centre

REVIEW PPT.pptx

building intelligent systems with large scale deep learning

VOC real world enterprise needs

Natural Language Processing, Techniques, Current Trends and Applications in I...

Making Sense of Selenium

Fake news detection

Learn Real World Machine Learning By Building Projects

tem7

"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...

Whole test suite generation

Open vocabulary problem

DevSecOps without DevOps is Just Security

Mais de Remy Blaettler

The Lean Startup (Theory & Motivation - Examples & Learnings)Remy Blaettler

Emotional Webdesign am Internet Briefing 2013 in BernRemy Blaettler

Multilingual Websites - UX Camp Europe 2013Remy Blaettler

Multilingual websites - UX Camp Berlin 2012Remy Blaettler

Online CAT and project management tools for translatorsRemy Blaettler

Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011Remy Blaettler

Make the user smile - Emotional Webdesign - UX Camp Berlin 2010Remy Blaettler

Mais de Remy Blaettler (7)

The Lean Startup (Theory & Motivation - Examples & Learnings)

Emotional Webdesign am Internet Briefing 2013 in Bern

Multilingual Websites - UX Camp Europe 2013

Multilingual websites - UX Camp Berlin 2012

Online CAT and project management tools for translators

Emotional Webdesign - How to make the user smile - Frontend Conf Zürich 2011

Make the user smile - Emotional Webdesign - UX Camp Berlin 2010

Último

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Halmar dropshipping via API with DroFxolyaivanovalion

April 2024 - Crypto Market Report's Analysismanisha194592

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Week-01-2.ppt BBB human Computer interactionfulawalesam

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Introduction-to-Machine-Learning (1).pptxfirstjob4

Bilingual term extraction with Big Data (and Gensim)

1. Bilingual Term Extraction with Big Data. Research project with the Zurich University of Applied Sciences (Switzerland). Conducted by Mark Cieliebak.

2. Rémy Blättler Chief of the System Developer. Co-Founder. Chief of the System.

3. In Zurich, Berlin and Los Angeles. In English, Spanish, Portuguese, German, Chinese, Japanese and 80 other languages.

4. • Simple word frequencies • Using dictionaries to find “phrases” Naïve approaches

5. • Shallow, two-layer neural networks trained to reconstruct linguistic contexts of words • Free & open-source GENSIM word2vec

6. breakfast cereal dinner lunch => Cereal GENSIM word2vec man => boy woman => girl Sweden  Norway 0.76  Finland 0.71  Estonia 0.54

7. New York City Terms and Conditions GENSIM Phrases Never Follow (Audi) Just do it (Nike)

8. 1. Average occurrence of a term over all corpora 2. Average occurrence of a term for one client 3. Same for the other language Term extraction 1

9. Detect the specific phrase in the source & target: Geänderte Segmentberichterstattung erhöht Aussagekraft. Improvements gained from changes in segment reporting. Term extraction 2

10. Plugin additional algorithms Works well for small fixes too GENSIM Upper- case Weighted output Entity extraction Data corpus Remove cities

11. German English myCloud myCloud Wie kann ich How can I Swisscom Broadcast Swisscom Broadcast HomepageTool HomepageTool Abo subscription Swisscom (Schweiz) AG Swisscom (Switzerland) Ltd Swisscom Sharespace Swisscom Sharespace App app Healthi Healthi KMU SME SharePoint SharePoint Endresultat Swisscom.” Ihre Webseite your website KMU SMEs Dateien Files Swisscom myCloud Swisscom myCloud generelles Rauchverbot cablex." Dateien Photos inOne mobile inOne mobile Business Apps Business Apps

12. The big picture: fast & easy client onboarding Webspider (with PhantomJS) (N)MT-based alignment Term extraction Structured text in multiple languages Translation memory Terminology database

13.

14. • Speed (tests take multiple hours) • Insufficient data (>50k TM units helps) • Bad source data (HTML, Javascript, etc.) Problems

15. $1,000 KTI Research Project $20,000 SpinningBytes & University Cooperation => $20,000 more until online & 80h+ internal Budget

16. Questions? @remyblaettler (Twitter) remy@supertext.ch

Notas do Editor

Cluster documents and classify them by topic Sentiment Analysis Recommendations, e.g. CRM
It gets complicated if a segment has multiple extracted phrases. => Find other segments with the same phrases
Next step: Replace Weighted output with Neural Network. Whats stopping us? Not enough Terminology data to train with. Maybe a public system would help.
Examples DON’T PROOFREAD

Bilingual term extraction with Big Data (and Gensim)

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Bilingual term extraction with Big Data (and Gensim)

Semelhante a Bilingual term extraction with Big Data (and Gensim) (20)

Mais de Remy Blaettler

Mais de Remy Blaettler (7)

Último

Último (20)

Bilingual term extraction with Big Data (and Gensim)

Notas do Editor