A Statistical Bilingual Word Alignment System
Word alignment is a natural language processing task that identifies the correspondence between words in two languages. TsinghuaAligner is a statistical bilingual word alignment system developed by the Natural Language Processing Group at Tsinghua University. It takes a set of sentence pairs that are translations of each other as input and produces word alignment automatically. TsinghuaAligner has the following features:
- Language independence. The system is language independent and can be used for arbitrary language pairs.
- Extensibility. Our system is based on log-linear models, which are very flexible to incorporate arbitrary information sources that are useful to discover the relationship between natural languages.
- Supervised learning. If you would like to teach the system how to align, you may provide the system with manually aligned parallel corpora. Then, the system is able to learn to minimize the difference between its alignments and the manual alignments.
- Unsupervised learning. The system is also capable of learning by itself from unlabeled data automatically and delivers pretty good performance.
- Rich structural constraints. TsinghuaAligner supports a variety of structural constraints such as many-to-many, ITG, and block ITG. These constraints prove to be effective in modeling the structural divergence between natural languages.
- Link posterior probabilities. The system is capable of producing the posterior probability for each link in alignments to indicate the confidence that two words are aligned.
Please click here to play with the online demo. The demo only supports Chinese and English now.
TsinghuaAligner supports Linux i686 and Max OSX. You need to install the following third-party software to build TsinghuaAligner:
This document describes how to install and use TsinghuaAligner and the technical details.
The source code and datasets are FREE to download.
||the package contains the source code of the system and example datasets
||A Chinese-English model that can be used by TsinghuaAligner
|Chinese-English training set
||Training set (700K sentence pairs from the United Nations and Hong Kong government websites)
|Chinese-English evaluation set
||development and test sets (900 sentence pairs with manual annotation)
Here is a list of institutions that downloaded TsinghuaAligner for research use.
- Universidade de São Paulo, Brazil
- Universidade Federal Fluminense, Brazil
- Beijing Normal University, China
- Beihang University, China
- Beijing University of Posts and Telecommunications, China
- Changsha University, China
- Hefei University, China
- Institute of Automation, Chinese Academy of Sciences, China
- Institute of Computing Technology, Chinese Academy of Sciences, China
- Jiangxi Normal University, China
- Kunming Institute of Botany, Chinese Academy of Sciences, China
- Liulishuo, China
- Nanjing Normal University, China
- Nanjing University, China
- National University of Defense Technology, China
- National Taiwan University, Taiwan, China
- National Tsinghua University, Taiwan, China
- Shanghai Normal University, China
- ShanghaiTech, China
- Xiamen University, China
- Xinjiang University, China
- Lingua Custodia, France
- Goethe University, Germany
- Dublin Institute of Technology, Ireland
- National Institute of Information and Communications Technology, Japan
- Nippon Telegraph & Telephone, Japan
- Toshiba, Japan
- Waseda University, Japan
- Textkernel, Netherlands
- 6Estates, Singapore
- Institute for infocomm Research, Singapore
- Incheon National University, South Korea
- Samsung, South Korea
- University of Geneva, Switzerland
- Middle East Technical University, Turkey
- University of Leeds, United Kingdom
- Gedanken Labs, United States
- Hanoi University, Vietnam
- (2015/04/22) The user manual was improved by adding a subsection for quick start and updating some references.
- (2014/12/13) A minor bug in TsinghuaAligner was fixed. The Chinese-English evaluation set was re-annotated by Ms Xiaomin Xue.
- (2014/10/31) Online demo and GUI available.
- (2014/10/07) Additional datasets provided.
- (2014/09/30) Initial release.
Yang Liu and Maosong Sun. 2015. Contrastive Unsupervised Word Alignment with Non-Local Features. In Proceedings of AAAI 2015, Austin, Texas, January. [paper][arXiv][slides]