《计算机应用研究》|Application Research of Computers

基于LSTM网络的序列标注中文分词法

Sequence labeling Chinese word segmentation method based on LSTM networks

免费全文下载 (已被下载 次)  
获取PDF全文
作者 任智慧,徐浩煜,封松林,周晗,施俊
机构 1.上海大学 通信与信息工程学院,上海 200444;2.中国科学院上海高等研究院,上海 201210;3.中国科学院大学,北京 100049
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2017)05-1321-04
DOI 10.3969/j.issn.1001-3695.2017.05.009
摘要 当前主流的中文分词方法是基于字标注的传统机器学习方法,但传统机器学习方法需要人为地从中文文本中配置并提取特征,存在词库维度高且利用CPU训练模型时间长的缺点。针对以上问题进行了研究,提出基于LSTM(long short-term memory)网络模型的改进方法,采用不同词位标注集并加入预先训练的字嵌入向量(character embedding)进行中文分词。在中文分词评测常用的语料上进行实验对比结果表明,基于LSTM网络模型的方法能得到比当前传统机器学习方法更好的性能;采用六词位标注并加入预先训练的字嵌入向量能够取得相对最好的分词性能;而且利用GPU可以大大缩短深度神经网络模型的训练时间;LSTM网络模型的方法也更容易推广并应用到其他自然语言处理中序列标注的任务。
关键词 中文分词;LSTM;字嵌入;自然语言处理
基金项目 国家自然科学基金资助项目(61471231)
中国科学院先导资助项目(XDA06010301)
本文URL http://www.arocmag.com/article/01-2017-05-009.html
英文标题 Sequence labeling Chinese word segmentation method based on LSTM networks
作者英文名 Ren Zhihui, Xu Haoyu, Feng Songlin, Zhou Han, Shi Jun
机构英文名 1.SchoolofCommunication&InformationEngineering,ShanghaiUniversity,Shanghai200444,China;2.ShanghaiAdvancedResearchInstitute,ChineseAcademyofSciences,Shanghai201210,China;3.UniversityofChineseAcademyofSciences,Beijing100049,China
英文摘要 Currently, the dominant state-of-the-art methods for Chinese word segmentation are based on character tagging methods by using traditional machine learning technology. However, there are some disadvantages in the traditional machine learning methods: artificially configuring and extracting features from Chinese texts, high dimension of the dictionary, long training time by just exploiting CPUs. This paper proposed an improved method based on long short-term memory (LSTM) network model. It used different tag set and added pre-trained character embeddings to perform Chinese word segmentation. Compared with the best result in Bakeoff and state-of-the-art methods, this paper conducted the experiments on common used corpuses. The results demonstrate that traditional machine learning methods are exceeded by the methods based on LSTM network. By using six-tag-set and adding pre-trained character embedding, the proposed method can reach the relatively highest performance on Chinese word segmentation. Then, it can greatly reduce the training time of deep neural network model by using GPUs. Moreover, the methods based on LSTM network can easily applied to other sequence labeling tasks in natural language processing(NLP).
英文关键词 Chinese word segmentation; LSTM; character embedding; NLP
参考文献 查看稿件参考文献
  [1] Wu Andi, Jiang Z. Word segmentation in sentence analysis[C] //Proc of International Conference on Chinese Information Processing. 1998:169-180. [2] Sui Z, Chen Y. The research on the automatic term extraction in the domain of information science and technology[C] //Proc of the 5th East Asia Forum of the Terminology. 2002. [3] Emerson T. The second international Chinese word segmentation bakeoff[EB/OL] . [2010-06-14] . http://www. aclweb. org/anthology/105-3017. pdf. [4] Levow G A. The third international Chinese language processing bakeoff:word segmentation and named entity recognition[C] //Proc of the 5th SIGHAN Workshop on Chinese Language Processing. 2006:108-117. [5] Xue Nianwen, Converse S P. Combining classifiers for Chinese word segmentation[C] //Proc of the 1st SIGHAN Workshop on Chinese Language Processing. Stroudsburg:Association for Computational Linguistics, 2002:1-7. [6] Xue Nianwen, Shen Libin. Chinese word segmentation as LMR tagging[C] //Proc of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg:Association for Computational Linguistics, 2003:176-179. [7] Peng Fuchun, Feng Fangfang, McCallum A. Chinese segmentation and new word detection using conditional random fields[C] //Proc of the 20th International Conference on Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2004. [8] 于江德, 睢丹, 樊孝忠. 基于字的词位标注汉语分词[J] . 山东大学学报:工学版, 2010, 40(5):117-122. [9] 于江德, 王希杰, 樊孝忠. 词位标注汉语分词中特征模板定量研究[J] . 计算机工程与设计, 2012, 33(3):1239-1244. [10] Zhao Hai, Huang Changning, Li Mu. An improved Chinese word segmentation system with conditional random field[C] //Proc of the 5th SIGHAN Workshop on Chinese Language Processing. 2006:162-165. [11] 黄昌宁, 赵海. 中文分词十年回顾[J] . 中文信息学报, 2007, 21(3):8-19. [12] 罗彦彦, 黄德根. 基于 CRFs 边缘概率的中文分词[J] . 中文信息学报, 2009, 23(5):3-8. [13] 赵海, 揭春雨. 基于有效子串标注的中文分词[J] . 中文信息学报, 2007, 21(5):8-13. [14] 黄德根, 焦世斗, 周惠巍. 基于子词的双层 CRFs 中文分词[J] . 计算机研究与发展, 2015, 47(5):962-968. [15] 徐浩煜, 任智慧, 施俊, 等. 基于链式条件随机场的中文分词改进方法[J] . 计算机应用与软件, 2016, 33(12):210-213. [16] Zheng Xiaoqing, Chen Hanyang, Xu Tianyu. Deep learning for Chinese word segmentation and POS tagging[C] //Proc of Conference on Empirical Methods in Natural Language Processing. 2013:647-657. [17] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J] . Journal of Machine Learning Research, 2011, 12(1):2493-2537. [18] Bengio Y, Schwenk H, Senécal J S, et al. Neural probabilistic language models[M] //Innovations in Machine Learning. Berlin:Springer, 2006:137-186. [19] Pei Wenzhe, Ge Tao, Chang Baobao. Max-margin tensor neural network for Chinese word segmentation[C] //Proc of Annual Meeting of the Association for Computational Linguistics. 2014:293-303. [20] Chen Xinchi, Qiu Xipeng, Zhu Chenxi, et al. Long short-term memory neural networks for chinese word segmentation[C] //Proc of Conference on Empirical Methods in Natural Language Processing. 2015. [21] Chen Xinchi, Qiu Xipeng, Zhu Chenxi, et al. Gated recursive neural network for Chinese word segmentation[C] //Proc of Annual Meeting of the Association for Computational Linguistics. 2015. [22] Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for sighan bakeoff[C] //Proc of the 4th SIGHAN Workshop on Chinese Language Processing. 2005:168-171. [23] Liu Pengfei, Qiu Xipeng, Chen Xinchi, et al. Multi-timescale long short-term memory neural network for modelling sentences and documents[C] //Proc of Conference on Empirical Methods in Natural Language Processing. 2005:2326-2335. [24] Wang Xin, Liu Yuanchao, Sun Chengjie, et al. Predicting polarities of tweets by composing word embeddings with long short-term memory[C] //Proc of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015:1343-1353. [25] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C] //Advances in Neural Information Processing Systems. 2014:3104-3112. [26] Wang Di, Nyberg E. A long short-term memory model for answer sentence selection in question answering[C] //Proc of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015:707-712. [27] Ghosh S, Vinyals O, Strope B, et al. Contextual LSTM (CLSTM) models for large scale NLP tasks[J] . ArXiv Preprint ArXiv:1602. 06291, 2016. [28] Vinyals O, Toshev A, Bengio S, et al. Show and tell:A neural image caption generator[C] //Proc of IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164. [29] Mikolov T, Chen Kai, Corrado G, et al. Efficient estimation of word representations in vector space[J] . Computer Science, 2013. [30] Chen Tianqi, Li Mu, Li Yutian, et al. MXNet:a flexible and efficient machine learning library for heterogeneous distributed systems[J] . Statistics, 2015. [31] SIGHAN. http://www. sighan. org/bakeoff2005/data/results. php. htm[EB/OL] .
收稿日期 2016/3/25
修回日期 2016/5/25
页码 1321-1324,1341
中图分类号 TP391.1
文献标志码 A