《计算机应用研究》|Application Research of Computers

一种基于双向LSTM的联合学习的中文分词方法

Joint learning method based on BLSTM for Chinese word segmentation

免费全文下载 (已被下载 次)  
获取PDF全文
作者 章登义,胡思,徐爱萍
机构 武汉大学 计算机学院,武汉 430072
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)10-008-2920-05
DOI 10.19734/j.issn.1001-3695.2018.03.0239
摘要 针对现有的基于深度学习的神经网络模型通常都是对单一的语料库进行训练学习,提出了一种大规模的多语料库联合学习的中文分词方法。语料库分别为简体中文数据集(PKU、MSRA、CTB6)和繁体中文数据集(CITYU、AS),每一个数据集输入语句的句首和句尾分别添加一对标志符。应用BLSTM(双向长短时记忆模型)和CRF(条件随机场模型)对数据集进行单独训练和多语料库共同训练的实验,结果表明大规模的多语料库共同学习训练能取得良好的分词效果。
关键词 中文分词; 大规模语料库; 联合学习; 双向长短时记忆模型
基金项目 国家重点研发计划资助项目(2017YFC0803700)
本文URL http://www.arocmag.com/article/01-2019-10-008.html
英文标题 Joint learning method based on BLSTM for Chinese word segmentation
作者英文名 Zhang Dengyi, Hu Si, Xu Aiping
机构英文名 School of Computer,Wuhan University,Wuhan 430072,China
英文摘要 The existing neural network models based on deep learning are usually trained on single criterion corpora. This paper proposed a joint learning method based on bi-directional long short-term memory(BLSTM) neural network and conditional random fields(CRF) for large-scale corpora. The corpora were composed of simplified Chinese data sets(PKU, MSRA, CTB6) and traditional Chinese data sets(CITYU, MSR). This method added a pair of identifiers to the beginning and end of each input sentence of the data set. The results of the experiments show that the proposed method has good effect on Chinese word segmentation for such large-scale corpora.
英文关键词 Chinese word segmentation; large-scale corpora; joint learning; bi-directional long short-term memory neural network model
参考文献 查看稿件参考文献
 
收稿日期 2018/3/16
修回日期 2018/6/12
页码 2920-2924
中图分类号 TP391.1
文献标志码 A