This software learns a bilingual lexicon from non-parallel data with the help of a small seed lexicon. The technique is described in the following paper:
Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, and Maosong Sun. Bilingual Lexicon Induction From Non-Parallel Data With Minimal Supervision. In Proceedings of AAAI, 2017. [paper]
Proceed to the download page.
This software has been tested in the following environment, but should work in a compatible one.
1. Compile the code.
2. Specify the variables in the
config file. For example, if
config contains the following lines:
config=zh-en lang1=zh lang2=en
then the data should be located in
data/zh-en with file extensions
3. Prepare data according to Step 2. Toy non-parallel data is provided, along with a Chinese-English seed lexicon with 100 word translation pairs. If your seed lexicon has more than 10000 entries, you need to modify the code by redefining
4. Train and obtain the bilingual lexicon.
5. The following files will be generated in
data/zh-en (the folder specified in
<translation candidate>:<cosine similarity to the source word vector>, separated by space and sorted in decreasing order of the cosine similarity.
</s>is the sentence marker; its translations should be ignored.