[1] About WAM In statistical machine translation, word alignment plays an important role because word-aligned bilingual corpus proves to be an excellent source of translation knowledge. The estimation of model parameters usually directly depends on word alignment, not only for phrase-based and hierarchical phrase-based models (Koehn et al., 2003; Chiang, 2007), but also for syntax-based models (Galley et al., 2006; Shen et al., 2008; Quirk et al., 2005; Liu et al., 2006; Huang et al., 2006). Most SMT systems face the problem of "alignment error propagation" as they only use 1-best alignments. In other words, alignment mistakes might lead to translation mistakes. To alleviate this problem, we propose a structure called weighted alignment matrix, which encodes the probability distribution for exponentially many alignments just in a linear space. In addition, we develop new algorithms to extract phrases (Koehn et al., 2003) and hierarchical phrases (Chiang, 2007) efficiently from weighted alignment matrices. This toolkit provides the source codes for * building weighted alignment matrices from a bilingual corpus using GIZA++, * extracting phrases, and * extracting hierarchical phrases. [2] Package Organization The package is organized as follows: * readme.txt // this document * src // source codes + wam // source codes for building wams - build-wam-v0.1 // Perl script for batch processing - giza2std-v0.2 // convert "*.A3.finalNBEST" files from GIZA++ to our format - symmetrize-v0.1 // get "grow-diag-final" alignments - getNBestList-v0.2 // get n-best "grow-diag-final" alignments - genWAM-v0.2 // produce final weighted alignment matrices + phrase // source codes for extracting phrases - extract-phrase-v0.4 // Perl script for batch processing - buildTTable-v0.3 // get word-to-word bilingual dictionaries - extractor-v0.5 // extract phrases - merger-v0.2 // collect fractional counts - normalizer-v0.2 // get relative frequency - remover-v0.2 // remove alignments - reverser-v0.2 // get phrases in the reverse direction - scorer-v0.2 // score phrases + hier // source codes for extracting hierarchical phrase - extract-hier-v0.1 // Perl script for batch processing - matrixBuildTTable-v0.2 // get word-to-word bilingual dictionaries - matrixExtractor-v0.7 // extract hierarchical phrases - merger-v0.2 // collect fractional counts - normalizer-v0.2 // get relative frequency - reverser-v0.2 // get phrases in the reverse direction - genRuleTable-v0.2 // produce final rule table * example // some examples + corpus // bilingual corpus - source // source text - target // target text + build_wam - wam.txt // weighted alignment matrices + extract_phrase - phraseTable.txt // phrase table + extract_hier - ruleTable.txt // hierarchical phrase table You need to compile the source codes before using them. For simple programs, just use "g++ -O2 -o object *.cpp". For complex programs, we provide makefiles. In the following, we will show how to use the toolkit. [3] Getting Started Step 1: prepare a bilingual corpus A bilingual corpus consists of two files. One file contains source language text. Each line is a tokenized sentence. Another file contains the corresponding target language sentences, which should be the translations of the source sentences line by line. There are two example files "source" and "target" in the directory "example/corpus". Step 2: build weighted alignment matrices First, you need to download GIZA++ freely available at http://code.google.com/p/giza-pp/. Compile it to obtain four executables: * GIZA++ * mkcls * plain2snt.out * snt2cooc.out Put them somewhere you like. Then, edit the lines 11-20 of the Perl script "src/wam/build-wam-v0.1/build-wam-v0.1.pl" to locate the executables required. Finally, run the following command: ./build-wam-v0.1.pl --src-file source --trg-file target --n 10 The option "--n" denotes n-best alignments will be used to build the matrices. The resulting file "wam.txt" contains the weighted alignment matrices for the bilingual corpus. There is an example file in the directory "example/build_wam/". Step 3: extract phrases or hierarchical phrases To extract phrases, edit the lines 10-18 of "src/phrase/extract-phrase-v0.4/extract-phrase-v0.4.pl" to locate the executables required. Then, run the following command: ./extract-phrase-v0.4.pl --src-file source --trg-file target --agt-file wam.txt --rule-table-file phraseTable.txt Each line in the resulting file "phraseTable.txt" is a phrase pair with four probabilities: source phrase ||| target phrase ||| p(s|t) l(s|t) p(t|s) l(t|s) where "p(*)" denotes relative frequency and "l(*)" denotes lexical weight. To extract hierarchical phrases, edit the lines 10-17 of the Perl script "src/hier/extract-hier-v0.1/extract-hier_v0_1.pl" to locate the executables required. Then, run the following command: ./extract-hier_v0_1.pl --src-file source --trg-file target --agt-file wam.txt --rule-table-file ruleTable.txt Each line in the resulting file "ruleTable.txt" is a hierarchical phrase pair with four probabilities: source hierarchical phrase ||| target hierarchical phrase ||| l(t|s) l(s|t) p(t|s) p(s|t) [4] Acknowledgement This work was supported by Microsoft Research Asia Natural Language Processing Theme Program grant (2009-2010). [5] Reference The following paper describes the weighted alignment matrix and related phrase extraction algorithm: @InProceedings{liu-EtAl:2009:EMNLP3, author = {Liu, Yang and Xia, Tian and Xiao, Xinyan and Liu, Qun}, title = {Weighted Alignment Matrices for Statistical Machine Translation}, booktitle = {Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing}, month = {August}, year = {2009}, address = {Singapore}, publisher = {Association for Computational Linguistics}, pages = {1017--1026}, url = {http://www.aclweb.org/anthology/D/D09/D09-1106} }