《计算机应用研究》|Application Research of Computers

基于字簇的多模型中文分词方法研究

Multi-model Chinese word segmentation method based on character clusters

免费全文下载 (已被下载 次)  
获取PDF全文
作者 李对红,王裴岩,张桂平,张少阳
机构 沈阳航空航天大学 人机智能研究中心,沈阳 110136
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2020)02-008-0355-05
DOI 10.19734/j.issn.1001-3695.2018.08.0540
摘要 字标注分词方法是当前中文分词领域中一种较为有效的分词方法,但由于中文汉字本身带有语义信息,不同字在不同语境中其含义与作用不同,导致每个字的构词规律存在差异。针对这一问题,提出了一种基于字簇的多模型中文分词方法,首先对每个字进行建模,然后对学习出的模型参数进行聚类分析形成字簇,最后基于字簇重新训练模型参数。实验结果表明,该方法能够有效地发现具有相同或相近构词规律的字簇,很好地区别了同类特征对不同字的作用程度。
关键词 中文分词; 构词规律; 模型参数; 聚类
基金项目 辽宁省自然科学基金计划重点项目(20170540705)
国家教育部人文社会青年科学研究基金资助项目(17YJC740087)
本文URL http://www.arocmag.com/article/01-2020-02-008.html
英文标题 Multi-model Chinese word segmentation method based on character clusters
作者英文名 Li Duihong, Wang Peiyan, Zhang Guiping, Zhang Shaoyang
机构英文名 Human-Computer Intelligence Research Center,Shenyang Aerospace University,Shenyang 110136,China
英文摘要 Character-based tagging method is currently an effective method in Chinese word segmentation. However, the Chinese characters have their own semantic information, different characters have different meanings and functions in different contexts, which lead to different correlations with context, resulting in the difference of word-formation rules for each word. To solve this problem, this paper proposed a multi-model segmentation method based on character clusters. Firstly, the method separately constructed a model for each word, then clustered the model parameters to form character clusters, and finally retrained the model parameters based on the character clusters. Experimental results show that this method can effectively find character clusters with the same or similar word-formation rules, and distinguish the effect of similar features for different characters.
英文关键词 Chinese word segmentation; word-formation rules; model parameters; clustering
参考文献 查看稿件参考文献
 
收稿日期 2018/8/6
修回日期 2018/10/8
页码 355-359,374
中图分类号 TP391
文献标志码 A