TsinghuaAligner: A Statistical Bilingual Word Alignment System
Note: The size of the manually aligned Chinese-English parallel corpus has been increased from 900 to 40,715 sentence pairs.
Word alignment is a natural language processing task that identifies the correspondence between words in two languages. TsinghuaAligner is a statistical bilingual word alignment system developed by the Natural Language Processing Group at Tsinghua University. It takes a set of sentence pairs that are translations of each other as input and produces word alignment automatically. TsinghuaAligner has the following features:
- Language independence. The system is language independent and can be used for arbitrary language pairs.
- Extensibility. Our system is based on log-linear models, which are very flexible to incorporate arbitrary information sources that are useful to discover the relationship between natural languages.
- Supervised learning. If you would like to teach the system how to align, you may provide the system with manually aligned parallel corpora. Then, the system is able to learn to minimize the difference between its alignments and the manual alignments.
- Unsupervised learning. The system is also capable of learning by itself from unlabeled data automatically and delivers pretty good performance.
- Rich structural constraints. TsinghuaAligner supports a variety of structural constraints such as many-to-many, ITG, and block ITG. These constraints prove to be effective in modeling the structural divergence between natural languages.
- Link posterior probabilities. The system is capable of producing the posterior probability for each link in alignments to indicate the confidence that two words are aligned.
Please click here to play with the online demo. The demo only supports Chinese and English now.
TsinghuaAligner supports Linux i686 and Max OSX. You need to install the following third-party software to build TsinghuaAligner:
This document describes how to install and use TsinghuaAligner and the technical details.
The source code and datasets are FREE to download.
||the package contains the source code of the system and example datasets
||A Chinese-English model that can be used by TsinghuaAligner
|Chinese-English training set
||Training set (700K sentence pairs from the United Nations and Hong Kong government websites)
|Chinese-English evaluation set
||development and test sets (40,715 sentence pairs with manual annotation)
Here is a list of institutions that downloaded TsinghuaAligner for research use (by 2018/10/13).
- Universidade de São Paulo, Brazil
- Universidade Federal Fluminense, Brazil
- Beijing Normal University, China
- Beihang University, China
- Beijing University of Posts and Telecommunications, China
- Changsha University, China
- Chengdu University of Information Technology, China
- Dalian University, China
- Geely, China
- Guangdong University of Foreign Studies, China
- Hefei University, China
- Institute of Automation, Chinese Academy of Sciences, China
- Institute of Computing Technology, Chinese Academy of Sciences, China
- Jiangxi Normal University, China
- Kunming Institute of Botany, Chinese Academy of Sciences, China
- Liulishuo, China
- Nanjing Normal University, China
- Nanjing University, China
- National University of Defense Technology, China
- National Cheng Kung University, Taiwan, China
- National Taiwan University, Taiwan, China
- National Tsinghua University, Taiwan, China
- Peking University, Taiwan, China
- Shanghai Normal University, China
- ShanghaiTech, China
- Shanxi University, China
- Sogou Corporation, China
- University of Science and Technology Beijing, China
- Tencent, China
- Xiamen University, China
- Xiaomi Corporation, China
- Xinjiang University, China
- Zhejiang University, China
- Zhejiang University of Finance and Economics, China
- Lingua Custodia, France
- Goethe University, Germany
- Universität Frankfurt am Main, Germany
- University of Potsdam, Germany
- Dublin Institute of Technology, Ireland
- National Institute of Information and Communications Technology, Japan
- Nippon Telegraph & Telephone, Japan
- Toshiba, Japan
- Waseda University, Japan
- Kazakh-British Technical University, Kazakhstan
- Textkernel, Netherlands
- Universiteit van Amsterdam, Netherlands
- 6Estates, Singapore
- Institute for infocomm Research, Singapore
- Incheon National University, South Korea
- Korea Advanced Institute of Science and Technology, South Korea
- Samsung, South Korea
- Idiap Research Institute, Switzerland
- University of Geneva, Switzerland
- Middle East Technical University, Turkey
- University of Leeds, United Kingdom
- Brigham Young University, United States
- Gedanken Labs, United States
- Massachusetts Institute of Technology, United States
- Hanoi University, Vietnam
- (2018/10/13) 39,815 new manually aligned Chinese-English sentence pairs were added to the evaluation set.
- (2015/04/22) The user manual was improved by adding a subsection for quick start and updating some references.
- (2014/12/13) A minor bug in TsinghuaAligner was fixed. The Chinese-English evaluation set was re-annotated by Ms Xiaomin Xue.
- (2014/10/31) Online demo and GUI available.
- (2014/10/07) Additional datasets provided.
- (2014/09/30) Initial release.
Yang Liu and Maosong Sun. 2015. Contrastive Unsupervised Word Alignment with Non-Local Features. In Proceedings of AAAI 2015, Austin, Texas, January. [paper][arXiv][slides]