《计算机应用研究》|Application Research of Computers

一种面向网络话题发现的增量文本聚类算法

Incremental algorithm for clustering texts in internet-oriented topic detection

免费全文下载 (已被下载 次)  
获取PDF全文
作者 殷风景,肖卫东,葛斌,李芳芳
机构 国防科学技术大学 C4ISR技术国防科技重点实验室,长沙 410073
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2011)01-0054-04
DOI 10.3969/j.issn.1001-3695.2011.01.013
摘要 为满足网络舆情监控系统中话题发现的需要,并克服经典single-pass算法处理网络文本聚类中受输入顺序影响和精度较低的主要不足,提出了ICIT算法,继承了single-pass算法的简单原理,保证了网络文本聚类的实时性;通过正文分词时标注词性选择名词动词进行正文向量化、建立文本标题向量来与文本正文向量共同表征文本、采用average-link策略、引入“代”的概念分批进行文本的聚类,以及在每批次聚类后添加报道重新选择调整所属的步骤来提高聚类的质量。实验证明了ICIT算法在提高话题发现准确度上的有效性和实用性。
关键词 话题发现;文本聚类;增量聚类;准确度;ICIT算法
基金项目 国家自然科学基金资助项目(60903225)
本文URL http://www.arocmag.com/article/1001-3695(2011)01-0054-04.html
英文标题 Incremental algorithm for clustering texts in internet-oriented topic detection
作者英文名 YIN Feng-jing, XIAO Wei-dong, GE Bin, LI Fang-fang
机构英文名 C4ISR Technology National Defense Science & Technology Key Lab, National University of Defense Technology, Changsha 410073, China
英文摘要 To meet the needs of topic detection for monitoring the public opinion on internet, this paper proposed an incremental clustering algorithm called ICIT to improve the two main disadvantages of single-pass algorithm, that was, being easily effected by the order of inputs and low precision.ICIT inherited the simple principle from single-pass to ensure clustering internet texts in real time and overcame its shortage by selecting only nouns and verbs from content as the content’s vector expression, using vector expression of title with content’s vector expression to express the text better, adopting average-link comparison strategy, introducing generation to accomplish batch process and add a stage for texts to reconsideration and adjust their ascription after first clustering. Experiments approve ICIT’s validity and practicability in heightening the precise of topic detection.
英文关键词 topic detection; text clustering; incremental clustering; precise; ICIT algorithm
参考文献 查看稿件参考文献
 
收稿日期
修回日期
页码 54-57
中图分类号
文献标志码 A