The Tsinghua NLP (thunlp) Group devotes to make our NLP algorithms and methods available to everyone, which are expected to be used in Chinese NLP, Knowledge Graphs, and Social Computing. These codes are produced by members at thunlp Lab, headed by Prof. Maosong Sun and Associate Prof. Zhiyuan Liu.
Highlight Packages
- THULAC: An Efficient Lexical Analyzer for Chinese. [home][Git C++][Git Java][Git Python]
- THUCTC: An Efficient Chinese Text Classifier. [home][Git Java]
- THUOCL: Open Chinese Lexicon. [home]
- OpenKE: An Open-Source Package for Knowledge Embedding (KE). [home][Git]
- OpenNE: An Open-Source Package for Network Embedding (NE). [Git]
Knowledge Graph and Relation Extraction
- NRE: An Open-Source Package for Neural Relation Extraction. [Git][TensorFlow Version]
Neural relation extraction aims to extract relations from plain text with neural models, which has been the state-of-the-art methods for relation extraction. In this package, we provide our implementations of CNN [Zeng et al., 2014] and PCNN [Zeng et al.,2015] and their extended version with sentence-level attention scheme [Lin et al., 2016].
- JointNRE: Joint Neural Relation Extraction with Text and KGs. [Git]
This is the lab code of our AAAI 2018 paper "Neural Knowledge Acquisition via Mutual Attention between Knowledge Graph and Text".
- PathNRE: Neural Relation Extraction with Relation Paths. [Git]
This is the lab code of our EMNLP 2017 paper "Incorporating Relation Paths in Neural Relation Extraction".
- Neural Entity Alignment. [Git]
This is the lab code of our IJCAI 2017 paper "Iterative Entity Alignment via Joint Knowledge Embeddings".
- Neural Entity Typing. [Git]
This is the lab code of our AAAI 2018 paper "Improving Neural Fine-Grained Entity Typing with Knowledge Attention".
Knowledge Representation Learning
- OpenKE: An Open-Source Package for Knowledge Embedding (KE). [Git]
- KRLPapers: Must-read papers on knowledge representation learning (KRL) / knowledge embedding (KE). [Git]
- TransX: An Efficient implementation of TransE and its extended models for Knowledge Representation Learning. [Git][TensorFlow Version]
- KB2E: A package of Knowledge Base to Embeddings. [Git]
The package contains state-of-the-art knowledge representation learning methods including TransE, TransH, TransR and PTransE.
- KR-EAR: Knowledge Representation Learning with Entities, Attributes and Relations. [Git]
This is the lab code of our IJCAI 2016 paper "Knowledge Representation Learning with Entities, Attributes and Relations".
- CKRL: Confidence-aware Knowledge Representation Learning. [Git]
This is the lab code of our AAAI 2018 paper "Does William Shakespeare REALLY Write Hamlet? Knowledge Representation Learning with Confidence". The method is expected to support robust knowledge representation learning with noisy triples.
- IKRL: Image-embodied Knowledge Representation Learning. [Git]
This is the lab code of our IJCAI 2017 paper "Image-embodied Knowledge Representation Learning". The method is expected to support knowledge representation learning with entity images.
- TKRL: Type-embodied Knowledge Representation Learning. [Git]
This is the lab code of our IJCAI 2016 paper "Representation Learning of Knowledge Graphs with Hierarchical Types". The method is expected to support knowledge representation learning with hierarchical types of entities.
- DKRL: Description-embodied Knowledge Representation Learning. [Git]
This is the lab code of our AAAI 2016 paper "Representation Learning of Knowledge Graphs with Entity Descriptions". The method is expected to support knowledge representation learning with entity descriptions.
Network Representation Learning
- OpenNE: An Open-Source Package for Network Embedding (NE). [Git]
- NRLPapers: Must-read papers on network representation learning (NRL) / network embedding (NE). [Git]
- TransNet: Translation-Based Network Representation Learning. [Git]
This is the lab code of our IJCAI 2017 paper "TransNet: Translation-Based Network Representation Learning for Social Relation Extraction". The method is expected to model social networks by regarding relations as the translation between vertices.
- NEU: Fast Network Embedding. [Git]
This is the lab code of our IJCAI 2017 paper "Fast Network Embedding Enhancement via High Order Proximity Approximation". The method is expected to speed up network embedding by approximate update algorithm.
- CANE: Context-Aware Network Embedding. [Git]
This is the lab code of our ACL 2017 paper "CANE: Context-Aware Network Embedding for Relation Modeling". The method is expected to support context-aware network representation learning and model asymmetric relations.
- MMDW: Max-Margin DeepWalk. [Git]
This is the lab code of our IJCAI 2016 paper "Max-Margin DeepWalk: Discriminative Learning of Network Representation". The method is expected to support discriminative network representation learning with node labels.
- TADW: Text-Associated DeepWalk. [Git]
This is the lab code of our IJCAI 2015 paper "Network Representation Learning with Rich Text Information". The method is expected to support network representation learning with rich text information within each node. The code requires a 64-bit linux machine with MATLAB installed.
Sememe-Driven NLP
- SE-WRL: Improved Word Representation Learning with Sememes. [Git]
This is the lab code of our ACL 2017 paper "Improved Word Representation Learning with Sememes". Sememes are minimum semantic units of word meanings, and the meaning of each word sense is typically composed by several sememes. We proposed the improved word representation learning method with sememe knowledge annotated in HowNet.
- Lexical Sememe Prediction. [Git]
This is the lab code of our IJCAI 2017 paper "Lexical Sememe Prediction via Word Embeddings and Matrix Factorization".
- Chinese LIWC Lexicon Expansion: Online Interpretable Word Embeddings. [Git]
This is the lab code of our AAAI 2018 paper "Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention".
Language Representation Learning
- CWE: Character Word Embeddings. [Git]
This is the lab code of our IJCAI 2015 paper "Joint Learning of Character and Word Embeddings". This method is expected to learn Chinese word embeddings by taking those characters within words into consideration. The analogical reasoning dataset on Chinese is available in data folder.
- CLWE: Cross-Lingual Word Embeddings. [home]
This is the lab code of our ACL 2015 short paper "Learning Cross-lingual Word Embeddings via Matrix Co-factorization". This method is expected to learn cross-lingual word embeddings with a matrix co-factorization framework.
- OIWE: Online Interpretable Word Embeddings. [Git]
This is the lab code of our EMNLP 2015 short paper "Online Learning of Interpretable Word Embeddings". This method is expected to learn interpretable word embeddings based on OIWE-IPG model proposed in our paper.
- TWE: Topical Word Embeddings. [Git]
This is the lab code of our AAAI 2015 paper "Topical Word Embeddings". The method is expected to perform representation learning of words with their topic assignments by latent topic models such as Latent Dirichlet Allocation.
General NLP
- THUCKE: An Open-Source Package for Chinese Keyphrase Extraction. [Git]
The package can efficiently extract Chinese keyphrases by translating from documents to keyphrases, learned by word alignment models (WAM) that we propoased in [EMNLP][CoNLL].
- TensorFlow-Summarization: An Open-Source Package for Neural Headline Generation. [Git]
This is an implementation of sequence-to-sequence model using a bidirectional GRU encoder and a GRU decoder. This project aims to help people start working on Abstractive Short Text Summarization immediately. And hopefully, it may also work on machine translation tasks.
- THUNSC: An Open-Source Package for Neural Sentiment Classification. [Git]
Neural Sentiment Classification aims to classify the sentiment in a document with neural models, which has been the state-of-the-art methods for sentiment classification. In this package, we provide our implementations of NSC, NSC+LA and NSC+UPA [Chen et al., 2016] in which user and product information is considered via attentions over different semantic levels.
- THUTAG: An Open-Source Package for Keyphrase Extraction and Social Tag Suggestion. [Git]
The package contains several keyphrase extraction methods including TextRank, ExpandRank, Topical PageRank and WAM, and social tag suggestion methods including KNN, PMI, TagLDA, TAM and WTM. The package has supported one of the most popular microblog apps, Weibo Keywords, which has got more than 3.5 million registered users.
- PLDA+: An Open-Source Package for Parallel LDA. [Git]
PLDA is a parallel C++ implementation of Latent Dirichlet Allocation (LDA). We present a highly optimized parallel implemention of the Gibbs sampling algorithm for the training/inference of LDA. The carefully designed architecture is expected to support extensions of this algorithm. PLDA+, an enhanced parallel implementation of LDA, can further improve scalability of LDA by significantly reducing the unparallelizable communication bottleneck and achieve good load balancing.
Last update: 22 Mar, 2018.