In statistical machine translation, word alignment plays an important role because word-aligned bilingual corpus proves to be an excellent source of translation knowledge. The estimation of model parameters usually directly depends on word alignment, not only for phrase-based and hierarchical phrase-based models (Koehn et al., 2003; Chiang, 2007), but also for syntax-based models (Galley et al., 2006; Shen et al., 2008; Quirk et al., 2005; Liu et al., 2006; Huang et al., 2006). Most SMT systems face the problem of "alignment error propagation" as they only use 1-best alignments. In other words, alignment mistakes might lead to translation mistakes. To alleviate this problem, we propose a structure called weighted alignment matrix, which encodes the probability distribution for exponentially many alignments just in a linear space. In addition, we develop new algorithms to extract phrases (Koehn et al., 2003) and hierarchical phrases (Chiang, 2007) efficiently from weighted alignment matrices.
This toolkit provides the source codes for
building weighted alignment matrices from a bilingual corpus using GIZA++,
extracting phrases, and
extracting hierarchical phrases.
This work was supported by Microsoft Research Asia Natural Language Processing Theme Program grant (2009-2010).