Bilingual term extraction with Big Data (and Gensim)
1. Bilingual Term Extraction
with Big Data.
Research project with the Zurich University of Applied Sciences (Switzerland).
Conducted by Mark Cieliebak.
3. In Zurich, Berlin and Los Angeles.
In English, Spanish, Portuguese, German,
Chinese, Japanese and 80 other languages.
4. • Simple word frequencies
• Using dictionaries to find “phrases”
Naïve approaches
5. • Shallow, two-layer neural networks trained to
reconstruct linguistic contexts of words
• Free & open-source
GENSIM word2vec
6. breakfast cereal dinner lunch => Cereal
GENSIM word2vec
man => boy woman => girl
Sweden
Norway 0.76
Finland 0.71
Estonia 0.54
7. New York City
Terms and Conditions
GENSIM Phrases
Never Follow (Audi)
Just do it (Nike)
8. 1. Average occurrence of a term over all corpora
2. Average occurrence of a term for one client
3. Same for the other language
Term extraction 1
9. Detect the specific phrase in the source & target:
Geänderte Segmentberichterstattung erhöht
Aussagekraft.
Improvements gained from changes
in segment reporting.
Term extraction 2
10. Plugin additional algorithms
Works well for small fixes too
GENSIM
Upper-
case
Weighted output
Entity
extraction
Data corpus
Remove
cities
11. German English
myCloud myCloud
Wie kann ich How can I
Swisscom Broadcast Swisscom Broadcast
HomepageTool HomepageTool
Abo subscription
Swisscom (Schweiz) AG Swisscom (Switzerland) Ltd
Swisscom Sharespace Swisscom Sharespace
App app
Healthi Healthi
KMU SME
SharePoint SharePoint
Endresultat Swisscom.”
Ihre Webseite your website
KMU SMEs
Dateien Files
Swisscom myCloud Swisscom myCloud
generelles Rauchverbot cablex."
Dateien Photos
inOne mobile inOne mobile
Business Apps Business Apps
12. The big picture: fast & easy client onboarding
Webspider
(with PhantomJS)
(N)MT-based
alignment
Term extraction
Structured text in
multiple languages
Translation
memory
Terminology
database
13.
14. • Speed (tests take multiple hours)
• Insufficient data (>50k TM units helps)
• Bad source data (HTML, Javascript, etc.)
Problems
15. $1,000 KTI Research Project
$20,000 SpinningBytes & University Cooperation
=> $20,000 more until online & 80h+ internal
Budget
Cluster documents and classify them by topic
Sentiment Analysis
Recommendations, e.g. CRM
It gets complicated if a segment has multiple extracted phrases.
=> Find other segments with the same phrases
Next step:
Replace Weighted output with Neural Network.
Whats stopping us? Not enough Terminology data to train with.
Maybe a public system would help.