《计算机应用研究》|Application Research of Computers

KEC:基于cw2vec的中文专利关键词提取方法

KEC:Chinese patent keyword extraction method based on cw2vec

免费全文下载 (已被下载 次)  
获取PDF全文
作者 谭婷婷,陈高荣,徐建
机构 南京理工大学 计算机科学与工程学院,南京 210094
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2020)10-005-2907-05
DOI 10.19734/j.issn.1001-3695.2019.06.0203
摘要 关键词提取是诸多文本挖掘任务的前置任务,其精度直接影响了下游任务的性能。 以中文专利为研究对象,针对专利文本的特点,将关键词提取问题转换成词向量聚类问题,提出了一种基于cw2vec词向量的关键词提取方法,称为KEC。该方法首先利用科技文献的关键词以及开源词典构建领域词典;接着,基于领域词典对专利文本进行预处理获取候选关键词,并采用构建cw2vec模型获得候选关键词的词向量表示;最后,采用聚类算法提取最终的关键词。在真实的专利数据集上进行了实验验证,结果表明KEC在精确率、召回率、综合指标<i>F<sub>1</sub></i>等指标项上优于现有的其他基于词聚类的关键词提取方法。
关键词 中文专利; 词向量; 关键词提取; 词聚类
基金项目 国家自然科学基金资助项目(61872186,61802205)
本文URL http://www.arocmag.com/article/01-2020-10-005.html
英文标题 KEC:Chinese patent keyword extraction method based on cw2vec
作者英文名 Tan Tingting, Chen Gaorong, Xu Jian
机构英文名 School of Computer Science & Engineering,Nanjing University of Science & Technology,Nanjing 210094,China
英文摘要 Keyword extraction is the pre-task of many text mining tasks, and its extraction accuracy has a direct impact on the performance of downstream tasks. Taking Chinese patent as the research object, this paper transformed the keyword extraction problem into the word vector clustering problem based on the characteristics of patent texts, and proposed a keyword extraction method based on cw2vec vector, called KEC(keyword extraction based on cw2vec). Firstly, it constructed the domain dictionary by using keywords of scientific and technological literature and open source dictionary. Then, it preprocessed the patent text based on domain dictionary to obtain candidate keywords, and obtained the vector representation of candidate keywords by constructing cw2vec model. Finally, it extracted the final keywords by clustering algorithm. Experiments on real patent datasets show that KEC is superior to other existing keyword extraction methods based on word clustering in terms of accuracy, recall and <i>F</i><sub>1</sub>-measure.
英文关键词 Chinese patent; word vector; keyword extraction; word clustering
参考文献 查看稿件参考文献
 
收稿日期 2019/6/24
修回日期 2019/7/31
页码 2907-2911,2916
中图分类号 TP391
文献标志码 A