A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this project, we present a matrix co-factorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix decomposition, and induce cross-lingual constraints for simultaneously factorizing monolingual matrices. The cross-lingual constraints can be derived from parallel corpora, with or without word alignments. Empirical results on a task of cross-lingual document classification show that our method is effective to encode cross-lingual knowledge as constraints for cross-lingual word embeddings.
Licensed under the Apache License, Version 2.0
This code is a simple implementation of the ACL 2015 paper, and it is based on GloVe implemented by Jeffrey Pennington
make
to compile
./vocab-count.out < corpus.txt > vocab.txt
to determine vocabulary (run once for each language), or you may use a given vocabulary from elsewhere.
./cooccur-mono.out < corpus.txt > mono.cooc
to count monolingual co-occurrences.
./cooccur-clc-wa.out < corpus.txt > cross.cooc
to count cross-lingual co-occurrences. (similar for ./cooccur-clc+wa.out < A3.final > cross.cooc
. For CLSim, you need to convert t3.final into the same file format produced by cooccur*.out
)
./summarize-real.out -input-file mono.cooc -save-file mono.cooc
to convert each co-occurrence file into sparse format and calculate PMI values.
./mf-bi-clc.out -vocab-file1 vocab1.txt -vocab-file2 vocab2.txt -iter 30 -threads 20 -input-file1 mono1.cooc -input-file2 mono2.cooc -input-file-bi bi.cooc -vector-size 40 -binary 1 -save-file1 vectors1.bin -save-file2 vectors2.bin
to learn cross-lingual embeddings. (similar for ./mf-bi-clsim
)
This research is supported by the 973 Program (No. 2014CB340501) and the National Natural Science Foundation of China (NSFC No. 61133012, 61170196 and 61202140).