Meng Zhang Yang Liu, Huanbo Luan, Maosong Sun

1 Meng Zhang Yang Liu, Huanbo Luan, Maosong SunEarth Move...
Author: Frederica Doyle
0 downloads 0 Views

1 Meng Zhang Yang Liu, Huanbo Luan, Maosong SunEarth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction Meng Zhang Yang Liu, Huanbo Luan, Maosong Sun

2 Outline Introduction Approach Experiments Conclusion

3 Task Description Chinese text 文化 culture 历史 history 音乐 music 学校 school 国家 country 时间 time The task is called bilingual lexicon induction. A bilingual lexicon is basically a bilingual dictionary, e.g. here is a Chinese-English dictionary. To induce a bilingual lexicon is to obtain such a dictionary from corpora of the two languages. If the corpora are parallel, which means the texts are translations of each other, then this task is considered solved by word alignment. So we usually focus on non-parallel corpora. In this case, previous approaches typically require a seed lexicon to help connect the two languages. In the next slide let’s take a closer look at one of these approaches. bilingual lexicon English text

4 Previous Approach Reliant on the seed lexiconPossible to remove the reliance?  Unsupervised This is a pioneering work by Mikolov et al. As illustrated in this figure for English and Spanish, they observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages. This interesting finding means a linear transformation may be established to connect word embedding spaces, allowing word feature to transfer cross-lingually. However, they rely on a seed lexicon to train the linear transformation. In this work, we are interested in the question: is it possible to completely remove the reliance on the seed lexicon, so that we can achieve unsupervised bilingual lexicon induction? The answer is yes, and next let’s see how to do it. (Mikolov et al., 2013)

5 Outline Introduction Approach Experiments Conclusion

6 Formulation 实现无监督地联系两个词向量空间,本质上是需要词向量空间之间整体的一种度量,或者说词向量分布之间距离的度量。

7 Earth Mover’s Distance (EMD)带权重的土堆和坑洞 最小化整体的移动代价 移动方案作为词汇翻译 自动处理一词多译

8 Earth Mover’s Distance (EMD)A distance measure between discrete distributions - ground distance - transport polytope

9 Wasserstein Distance A generalization to allow continuous distributions - the set of all joint distributions with marginals P1 and P2 on the first and second factors respectively For our task, the EMD and the Wasserstein distance are equivalent

10 Formulation 回顾一下建模,目标就是寻找一个映射G,使得映射后的源语言词向量分布和目标语言词向量分布的EMD或者说Wasserstein距离最小化。

11 Wasserstein GAN (WGAN)

12 EMD Minimization Under Orthogonal Transformation (EMDOT)An orthogonal transformation is both empirically supported and theoretically appealing

13 Discussion EMDOT is attractive for several reasonsCompatible with the orthogonal constraint Little assumption and approximation Guaranteed to converge Few hyperparameters Fast in speed But it converges to (often poor) local minima WGAN is better at landing in a neighborhood of a good minimum

14 WGAN Training Trajectory虽然曲线值经过一个突变后维持在一个较为合理的范围内,但仍然有明显波动,效果不太稳定。此时就是EMDOT方法擅长的方面了,它能单调减少目标函数的值,效果也不断改善,最终收敛。因此,使用WGAN方法寻找一组参数来初始化EMDOT方法,就能很好地结合两种方法的优势,最终取得良好效果。

15 Outline Introduction Approach Experiments Conclusion

16 Bilingual Lexicon Induction SetupFive language pairs Chinese-English, Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English Reference methods Translation matrix (TM) (Mikolov et al., 2013) Isometric alignment (IA) (Zhang et al., 2016) TM, IA, and WGAN are postprocessed with EMD to handle multiple alternative translation Performance measure F1 score TM and IA, need seeds to train. Therefore, they are not directly comparable to our approach, and their performances are shown as reference. TM is the approach by Mikolov et al. we saw earlier in the introduction, and IA is an extension of TM by including the orthogonal constraint.

17 Bilingual Lexicon Induction ResultsWGAN successfully finds reasonable transformations. From these initializations, EMDOT considerably improves on them. With few seeds, the performances of TM and IA suffer. Especially TM, which indicates the importance of the orthogonal constraint when the seeds are few.

18 Language Distance SetupThe EMD between embeddings of two languages can serve as a proxy for language distance Two factors that influence language distance Genealogy - typology dissimilarity Language contact - geographical distance The two factors are correlated We show the EMD also correlates with them

19 Language Distance ResultsSpanish, Italian, English Close both genealogically and geographically English, Chinese, Japanese Different language families Intensive language contact between Chinese and Japanese Turkish, English Distant both genealogically and geographically

20 Outline Introduction Approach Experiments Conclusion

21 Conclusion Unsupervised bilingual lexicon induction FormulationNon-parallel data  bilingual lexicon Formulation Find a transformation that minimizes the distance between embedding distributions Distribution-level minimization removes the need for word-level cross-lingual supervision The earth mover’s distance A natural distance for this task Quantification of language distance Bilingual lexicon induction seems to be an intrinsically cross-lingual task, and appears formidable with only monolingual data, but we make it possible by formulating the task to find…

22 Thanks