TsinghuaAligner: A Statistical Bilingual Word Alignment System

Note: The size of the manually aligned Chinese-English parallel corpus has been increased from 900 to 40,715 sentence pairs.

Introduction

Word alignment is a natural language processing task that identifies the correspondence between words in two languages. TsinghuaAligner is a statistical bilingual word alignment system developed by the Natural Language Processing Group at Tsinghua University. It takes a set of sentence pairs that are translations of each other as input and produces word alignment automatically. TsinghuaAligner has the following features:

Language independence. The system is language independent and can be used for arbitrary language pairs.
Extensibility. Our system is based on log-linear models, which are very flexible to incorporate arbitrary information sources that are useful to discover the relationship between natural languages.
Supervised learning. If you would like to teach the system how to align, you may provide the system with manually aligned parallel corpora. Then, the system is able to learn to minimize the difference between its alignments and the manual alignments.
Unsupervised learning. The system is also capable of learning by itself from unlabeled data automatically and delivers pretty good performance.
Rich structural constraints. TsinghuaAligner supports a variety of structural constraints such as many-to-many, ITG, and block ITG. These constraints prove to be effective in modeling the structural divergence between natural languages.
Link posterior probabilities. The system is capable of producing the posterior probability for each link in alignments to indicate the confidence that two words are aligned.

Online Demo

Please click here to play with the online demo. The demo only supports Chinese and English now.

System Requirements

TsinghuaAligner supports Linux i686 and Max OSX. You need to install the following third-party software to build TsinghuaAligner:

GIZA++. It can be downloaded at https://code.google.com/p/giza-pp/;
g++ version 4.6.3 or higher;
Python version 2.7;
JRE version 1.6 or higher (optional, only for GUI).

User Manual

This document describes how to install and use TsinghuaAligner and the technical details.

Downloads

The source code and datasets are FREE to download.

Link	Size	Description	Date
TsinghuaAligner.tar.gz	715KB	the package contains the source code of the system and example datasets	2015/04/22
model.ce.tar.gz	57MB	A Chinese-English model that can be used by TsinghuaAligner	2014/10/07
Chinese-English training set	43MB	Training set (700K sentence pairs from the United Nations and Hong Kong government websites)	2014/10/07
Chinese-English evaluation set	4.7MB	development and test sets (40,715 sentence pairs with manual annotation)	2018/10/13

Here is a list of institutions that downloaded TsinghuaAligner for research use (by 2018/10/13).

Universidade de São Paulo, Brazil
Universidade Federal Fluminense, Brazil
Beijing Normal University, China
Beihang University, China
Beijing University of Posts and Telecommunications, China
Changsha University, China
Chengdu University of Information Technology, China
Dalian University, China
Geely, China
Guangdong University of Foreign Studies, China
Hefei University, China
Institute of Automation, Chinese Academy of Sciences, China
Institute of Computing Technology, Chinese Academy of Sciences, China
Jiangxi Normal University, China
Kunming Institute of Botany, Chinese Academy of Sciences, China
Liulishuo, China
Nanjing Normal University, China
Nanjing University, China
National University of Defense Technology, China
National Cheng Kung University, Taiwan, China
National Taiwan University, Taiwan, China
National Tsinghua University, Taiwan, China
Peking University, Taiwan, China
Shanghai Normal University, China
ShanghaiTech, China
Shanxi University, China
Sogou Corporation, China
University of Science and Technology Beijing, China
Tencent, China
Xiamen University, China
Xiaomi Corporation, China
Xinjiang University, China
Zhejiang University, China
Zhejiang University of Finance and Economics, China
Lingua Custodia, France
Goethe University, Germany
Universität Frankfurt am Main, Germany
University of Potsdam, Germany
Dublin Institute of Technology, Ireland
National Institute of Information and Communications Technology, Japan
Nippon Telegraph & Telephone, Japan
Toshiba, Japan
Waseda University, Japan
Kazakh-British Technical University, Kazakhstan
Textkernel, Netherlands
Universiteit van Amsterdam, Netherlands
6Estates, Singapore
Institute for infocomm Research, Singapore
Incheon National University, South Korea
Korea Advanced Institute of Science and Technology, South Korea
Samsung, South Korea
Idiap Research Institute, Switzerland
University of Geneva, Switzerland
Middle East Technical University, Turkey
University of Leeds, United Kingdom
Brigham Young University, United States
Gedanken Labs, United States
Massachusetts Institute of Technology, United States
Hanoi University, Vietnam

History

(2018/10/13) 39,815 new manually aligned Chinese-English sentence pairs were added to the evaluation set.
(2015/04/22) The user manual was improved by adding a subsection for quick start and updating some references.
(2014/12/13) A minor bug in TsinghuaAligner was fixed. The Chinese-English evaluation set was re-annotated by Ms Xiaomin Xue.
(2014/10/31) Online demo and GUI available.
(2014/10/07) Additional datasets provided.
(2014/09/30) Initial release.

References

Yang Liu and Maosong Sun. 2015. Contrastive Unsupervised Word Alignment with Non-Local Features. In Proceedings of AAAI 2015, Austin, Texas, January. [paper][arXiv][slides]

Contact

Yang Liu