《计算机应用研究》|Application Research of Computers

基于语义的文档关键词提取方法

Semantic-based keyword extraction method for document

免费全文下载 (已被下载 次)  
获取PDF全文
作者 姜芳,李国和,岳翔
机构 1.中国石油大学(北京) 地球物理与信息工程学院 油气数据挖掘北京市重点实验室,北京 102249;2.中海油研究总院 信息数据中心,北京 100029
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2015)01-0142-04
DOI 10.3969/j.issn.1001-3695.2015.01.032
摘要 以语义为基础实现文档关键词提取是提高自动提取准确度的有效途径。以中文文档为处理对象,通过《同义词词林》计算词语间语义距离,对词语进行密度聚类,得到主题相关类,并从主题相关类中选取中心词作为关键词。通过统计实验和打分实验,证明基于语义的文档关键词提取方法具有较高的准确率、召回率,并且提取的关键词具有较高的主题相关度。
关键词 语义距离;密度聚类;关键词提取
基金项目 国家“863”计划资助项目(2009AA062802)
国家自然科学基金资助项目(60473125)
中国石油(CNPC)石油科技中青年创新基金资助项目(05E7013)
国家重大专项子课题(G5800-08-ZS-WX)
本文URL http://www.arocmag.com/article/01-2015-01-032.html
英文标题 Semantic-based keyword extraction method for document
作者英文名 JIANG Fang, LI Guo-he, YUE Xiang
机构英文名 1. Beijing Key Laboratory of Data Mining for Petroleum Data, College of Geophysics & Information Engineering, China University of Petroleum, Beijing 102249, China; 2. Information & Data Center, CNOOC Research Institute, Beijing 100029, China
英文摘要 Document keywords extraction on the basis of semantic was an effective way to improve the accuracy of automatic extraction. This paper regarded Chinese document as processing object, calculated the semantic distances between words through the synonyms dictionary. Then, through density clustering of the words, it got theme related classes. Finally, it regarded the headwords selected from topic related classes as keywords. Statistical experiment and scale experiment prove that the semanticbased keyword extraction method for document has higher accuracy, recall rate and the extracted keywords have higher related degrees to the topic.
英文关键词 semantic distance; density clustering; keyword extraction
参考文献 查看稿件参考文献
  [1] KRUPKA G. SRA:description of the SRA system as used for MUC-6[C] //Proc of the 6th Message Understanding Conference. San Francisco:Morgan Kaufmann Publishing, 1995:201-206.
[2] JANG D H, MYAENG S H. Development of a document summarization system for effective information services[C] //Proc of RIAO Conference Proceedings:Computer-Assisted Information Searching on Internet. 1997:101-111.
[3] KRULWICH B, BURKEY C. Learning user information interests through the extraction of semantically significant phrases[C] //Proc of AAAI Spring Symposium on Machine Learning in Information Access. [S. l. ] :AAAI Press, 1996:106-120.
[4] YANG Wen-feng. Chinese keyword extraction based on max duplicated strings of the documents[C] //Proc of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press, 2002:439-440.
[5] 王军. 词表的自动丰富——从元数据中抽取关键词及其定位[J] . 中文信息学报, 2005, 19(6):36-43.
[6] 刘远超, 王晓龙, 徐志明, 等. 基于粗集理论的中文关键词短语构成规则挖掘[J] . 电子学报, 2007, 35(2):371-374.
[7] TURNEY P D. Learning to extract key phrases from text, NRC-41622, ERB-1057[R] . Ottawa:National Research Council, 1999.
[8] WITTEN I H, PAYNTER G W, FRANK E, et al. KEA:practical automatic keyphrase extraction[C] //Proc of the 4th ACM Conference on Digital Libraries. New York:ACM Press, 1999:254-256.
[9] MuQNOZ A. Compound key word generation from document databases using a hierarchical clustering ART model[J] . Intelligent Data Analysis, 1996, 1(1):23-28.
[10] STEIER A M, BELEW R K. Exporting phrases:a statistical analysis of topical language[C] //Proc of the 2nd Symposium on Document Analysis and Information Retrieval. 1993:179-190.
[11] 李素建, 王厚峰, 余士汶, 等. 关键词自动标引的最大熵模型应用研究[J] . 计算机学报, 2004, 27(9):1192-1197.
[12] 王立霞, 淮晓永. 基于语义的中文文本关键词提取算法[J] . 计算机工程, 2012, 38(1):13-17.
[13] ERTOZ L, STEINBACH M, KUMAR V. Finding clusterings of different sizes, shapes, and densities in noisy, high dimensional data[C] //Proc of SIAM International Conference on Data Mining. 2003:142-147.
[14] ZHAN Yan-chang, SONG Mei, XIE Fan, et al. Clustering datasets containing clusters of various densities[J] . Journal of Beijing University of Posts and Telecommunications, 2003, 26(2):42-47.
[15] KARYPIS G, HAN E H, KUMAR V. Chameleon:a hierarchical clustering algorithm using dynamic modeling[J] . IEEE Computer, 1999, 32(8):68-75.
收稿日期 2013/12/3
修回日期 2014/1/21
页码 142-145
中图分类号 TP391.43
文献标志码 A