《计算机应用研究》|Application Research of Computers

基于SOM聚类的微博话题发现

Microblog topics detection based on SOM clustering

免费全文下载 (已被下载 次)  
获取PDF全文
作者 宋莉娜,冯旭鹏,刘利军,黄青松
机构 昆明理工大学 a.信息工程与自动化学院;b.教育技术与网络中心;c.云南省计算机技术应用重点实验室,昆明 650500
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)03-0671-04
DOI 10.3969/j.issn.1001-3695.2018.03.007
摘要 随着微博用户的增多,微博平台的信息更新频繁。针对微博文本的数据稀疏性、新词多、用语不规范等特点,提出了基于SOM聚类的微博话题发现方法。从原始语料中对文本进行预处理,通过词向量模型对短文本进行特征提取,降低了向量维度过高带来的计算量繁重问题。采用改进的SOM对话题进行聚类,该算法改善了传统文本聚类的不足,进而能有效地发现话题。实验表明该算法较传统文本聚类算法的综合指标F值有明显提高。
关键词 话题发现;词向量模型;文本相似度;短文本;SOM聚类
基金项目 国家自然科学基金资助项目(81360230,81560296)
本文URL http://www.arocmag.com/article/01-2018-03-007.html
英文标题 Microblog topics detection based on SOM clustering
作者英文名 Song Lina, Feng Xupeng, Liu Lijun, Huang Qingsong
机构英文名 a.FacultyofInformationEngineering&Automation,b.EducationalTechnology&NetworkCenter,c.YunnanProvincialKeyLaboratoryofComputerTechnologyApplications,KunmingUniversityofScience&Technology,Kunming650500,China
英文摘要 With the increase of microblog users, the information of microblog platform is updating frequently. This paper proposed microblog topics detection based on SOM clustering for the features of the microblog text data sparseness, new words and non-standard words. Firstly, it pretreated the short texts from the primitive text corpus, and extracted the features of the short texts by the word vector model which reduced the computational burden caused by the high vector dimension. In order to reduce the large amount of computation just to the high vector dimensions, this paper extracted the short text feature extraction by word vector model. Then, the topic clustering could be achieved by an improved SOM clustering. The algorithm improved the traditional texts clustering shortcoming. And the algorithm could find the topic effectively. Experimental results show that the algorithm’s comprehensive index F value is improved obviously than the traditional methods.
英文关键词 topics detection; word vector model; texts similarity; short texts; SOM clustering
参考文献 查看稿件参考文献
  [1] Wang Yuan, Liu Jie, Huang Yalou, et al. Using hash tag graph-based topic model to connect semantically-related words without co-occurrence in microblogs[J] . IEEE Trans on Knowledge and Data Engineering, 2016, 28(7):1919-1933.
[2] 贺敏, 王丽宏, 杜攀, 等. 基于有意义串聚类的微博热点话题发现方法[J] . 通信学报, 2013, 34(z1):256-262.
[3] 贺亮, 李芳. 基于话题模型的科技文献话题发现和趋势分析[J] . 中文信息学报, 2012, 26(2):109-115.
[4] 徐佳俊, 杨飏, 姚天昉, 等. 基于LDA模型的论坛热点话题识别和追踪[J] . 中文信息学报, 2016, 30(1):43-49.
[5] 刘星星, 何婷婷, 龚海军, 等. 网络热点事件发现系统的设计[J] . 中文信息学报, 2008, 22(6):80-85.
[6] 格桑多吉, 乔少杰, 韩楠, 等. 基于Single-Pass的网络舆情热点发现算法[J] . 电子科技大学学报, 2015, 44(4):599-604.
[7] 杨菲, 黄伯雄. 词共现网络的遗传算法在话题发现中的应用[J] . 计算机工程与软件, 2013, 49(14):126-129.
[8] 于洁. Skip-Gram 模型融合词向量投影的微博新词发现[J] . 计算机系统应用, 2016, 25(7):130-136.
[9] 刘铭, 刘秉权, 刘远超. 面向信息检索的快速聚类算法[J] . 计算机研究与发展, 2013, 50(7):1452 -1463.
[10] 方延风, 陈健. 基于词向量距离的相关词变迁研究——以《情报探索》杂志摘要为例[J] . 情报探索, 2015(4):5-7, 10.
[11] 郭胜国, 郭丹丹. 基于词向量的句子相似度计算及其应用研究[J] . 现代电子技术, 2016, 38(13):99-107.
[12] Zhao Jingling, Zhang Huiyun, Cui Baojiang. Sentence similarity based on semantic vector model[C] //Proc of the 9th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing. 2014:499-503.
[13] 刘芳. 基于SOM聚类的可视化方法及应用研究[J] . 计算机应用研究, 2012, 29(4):1300-1303, 1306.
[14] Grtner T. A survey of kernrls for structured data[J] . ACM SIGKDD Explorations Newsletter, 2003, 5(1):49-58.
[15] Hammer B, Micheli A, Sperduti A, et al. Recursive self-organizing network models[J] . Neural Networks, 2004, 17(8):1061-1085.
[16] Tsutsumi K, Nakajima K. Maximum/minimum detection by a mo-dule-based neural network with redundant architecture[C] //Proc of International Joint Conference on Neural Networks. 1999:558-561.
[17] Deng Zhidong, Mao Chengzhi, Chen Xiong. Deep self-organizing reservoir computing model for visual object recognition[C] //Proc of International Joint Conference on Neural Networks. 2016:1325-1332.
[18] Qiu Lin, Xu Jungang. A Chinese word clustering method using latent dirichlet allocation and K-means[C] //Proc of the 2nd International Conference on Advances in Computer Science and Engineering. 2013:267-270.
[19] Yan Danfeng, Hua Enzheng, Hu Bo. An improved single-pass algorithm for Chinese microblog topic detection and tracking[C] //Proc of IEEE International Congress on Big Data. 2016:251-258.
[20] 郑飞, 张蕾. 基于分类的中文微博热点话题发现方法研究[C] //第29次全国计算机安全学术交流会论文集. 2014:311-314.
收稿日期 2016/11/16
修回日期 2017/1/5
页码 671-674,679
中图分类号 TP391
文献标志码 A