Expanding NLP models to new languages typically involves annotating new data sets which is time and resource expensive. To reduce the costs one can use cross-lingual embeddings enabling knowledge transfer from languages with sufficient training data to low-resource languages. In this talk, you will hear about the challenges in learning cross-lingual embeddings for multilingual resume parsing.
How to expand your nlp solution to new languages using transfer learning
1. How to expand your NLP solution to new
languages using transfer learning
Lena Shakurova
shakurova@textkernel.nl
Beata Nyari, Chao Li, Mihai Rotaru
2019-05-12
2. What this talk is about
You have an NLP solution for several languages
You want to support more languages
No training data, a lot of raw data
How to expand your solution to new languages using
transfer learning?
12. Issue Proposed solution
Multilingual model
• Implement models for new
languages as fast as
possible
• Improve performance on
low-resource languages
(using transfer learning and
cross-lingual embeddings)
• 20 languages
• Separate models for
separate languages
• New languages (100+)
lack labeled data
15. Pre-trained
embeddings
MUSE
• 30 languages in a shared space
• already give good results
Open source
alignment code
Bilingual:
- Vecmap
Multilingual:
- Multilingual Fasttext
- UMWE
- CCA:
github.com/gallantlab/pyrcca
In our research we used
16. Canonical correlation analysis (CCA)
• Train monolingual word embeddings
• Learn the transformation matrices
using bilingual dictionary
• Map the monolingual spaces into one
shared semantic space in such a way
that translation pairs are maximally
correlated
Faruqui, M., & Dyer, C. (2014)
Σ
English
space
------------
Bilingual
dictionary
Ω
German
space
------------
Bilingual
dictionary
V
Transformation
matrix
(English)
Σ*
Transformed
English
space
W
Transformation
matrix
(German)
Ω*
Transformed
German
space
Shared space
17. Canonical correlation analysis (CCA)
• Ω* and Σ* lie in the same space
• Ω* can be projected into the English
embedding space Σ using the inverse
of V:
Ω** = V−1 * Ω*
Σ
English
space
------------
Bilingual
dictionary
Ω
German
space
------------
Bilingual
dictionary
V
Transformati
on matrix
(English)
Σ*
Transformed
English
space
W
Transformation
matrix
(German)
Ω*
Transformed
German
space
Shared space
19. Zero-shot parsing
Joint training
Parsed
German CV
Projected German
embeddings
English train
data
English
embeddings
Trained model
English train
data
German
train data
Projected German
embeddings
Trained model
Parsed
German CV
Projected German
embeddings
Testing
English
embeddings
Training
21. Experimental setup
Task:
• Parse German
• Extract job title and
organisation
Embeddings:
• Trained on domain data
• Word2vec
• CCA
Does transfer learning work for us?
How bilingual dictionary influences downstream
performance?
23. 75.8
+4.1
+0.2
Does transfer learning work?
Monolingual
• More German data -> better
performance
Cross lingual
• Zero-shot parsing works
• Gain from transfer learning
• The more data we have the
smaller is cross-lingual gain
Zero-shot parsing
24. 2. Construct your own:
• Use domain data
How to construct bilingual dictionary?
1. Use ready bilingual
dictionaries:
○ Internet Dictionary Project (IDP)
○ MUSE
■ 110 bilingual dictionaries
■ Created for development and the
evaluation of cross-lingual word
embeddings
Choose
English words
Translate into
German
Frequency
Size
Filtering
Google
translate
Yandex
translate
25. Source of data: IDP vs. muse vs. CV
Using bilingual dictionary with domain data boosts performance
Zero shot parsing
Joint training
CV vocabulary
CV vocabulary
61.5
72.1
75.879.5
80.4 81.1
26. Frequency: top vs. less frequent words
Using bilingual dictionary with top frequency words boosts
performance
Zero shot parsing
Joint training
Top frequent
Top frequent
65.6
75.780.1
80.7
27. Size of bilingual dictionary: 1k vs. 5k vs. 10k
Bilingual dictionary of bigger size boosts performance
Zero shot parsing
Joint training
5k / 10k
5k / 10k
70.4
76.3 76.680.0
81.1 81.4
28. Bilingual dictionary: what did we learn?
Best practices for constructing bilingual dictionary:
1. Domain words
2. Frequent words
3. Of size 5k or 10k
The less training data you have available, the more attention you
need to pay to bilingual dictionary.
34. Dutch to English
• Zero-shot parsing works
• Gain from transfer
learning
• The more data we have
the smaller is cross-
lingual gain
79.1 +6.3 +1.3
DutchZero-shot parsing
37. Summary
• Transfer learning works
• Pretty good results on zero shot
• Cross-lingual gain reduces as we add more data from target
language
• The quality of bilingual dictionary affects the end task
performance
• The less training data you have available, the more attention you
need to pay to bilingual dictionary
• Use top 5k most frequent words in your domain corpora