《计算机应用研究》|Application Research of Computers

基于中心词的上下文主题模型

Centroid-word based context topic model

免费全文下载 (已被下载 次)  
获取PDF全文
作者 常东亚,严建峰,杨璐
机构 苏州大学 计算机科学与技术学院,江苏 苏州 215006
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)04-1005-05
DOI 10.3969/j.issn.1001-3695.2018.04.010
摘要 潜在狄利克雷分配(LDA)主题模型是处理非结构化文档的有效工具。但是它是建立在词袋模型(bag of word,BOW)假设上的,这种假设把每一篇文档看成是单词的组合,既不考虑文档与文档之间的顺序关系,也不考虑单词与单词之间的顺序关系。同时针对现有的模型精度不高,提出了基于中心词的上下文主题模型。这种模型的思想是一篇文档中单词的主题与其附近若干单词的主题关系更为紧密。在计算每个单词的主题分布时,以这个词为中心,前后各扩展若干个单词作为窗口,然后对每个窗口进行计算。这种方法就会形成窗口与窗口之间的顺序,从而形成单词之间也是局部有序。同时由于每个单词的上下文信息不同,所以每个单词的主题分布与其所在文档中的位置有关。通过实验表明,基于中心词的上下文主题模型在未知数据集上具有更高的精度和收敛速度。
关键词 潜在狄利克雷分配;主题模型;上下文信息
基金项目 国家自然科学基金资助项目(61373092,61572339,61272449)
江苏省科技支持计划重点项目(BE2014005)
本文URL http://www.arocmag.com/article/01-2018-04-010.html
英文标题 Centroid-word based context topic model
作者英文名 Chang Dongya, Yan Jianfeng, Yang Lu
机构英文名 SchoolofComputerScience&Technology,SoochowUniversity,SuzhouJiangsu215006,China
英文摘要 Latent Dirichlet allocation(LDA) topic model is an effective tool to process unstructured documents. But it is built on bag-of-words(BOW) model assumption, which regard each document as a combination of the word, neither the order relationship between documentsnor the order relationship between words is concerned.To improve current model’s accuracy, this paper came up with the centroid-word based context topic model, this model was based on the theory that the topic of a word in a document had strong relationship of the word which near by. When calculating the topic distribution for each word, it regared the word as the center, extend before and after several words as the window, and then performed a calculation on each window. This approach would generate the corresponding order of each window, the same as the order of words, and because of the contexts of each word were different, so the distribution of each word had relationship with the location the word in the corresponding document. Experiments show that the centroid-word based context topic model has the better accuracy and convergence rate on unknown datasets.
英文关键词 latent Dirichlet allocation; topic model; context information
参考文献 查看稿件参考文献
  [1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J] . Journal of Machine Learning Research, 2003, 3:993-1022.
[2] Dunlavy D M, OLeary D P, Conroy J M, et al. QCS:a system for querying, clustering and summarizing documents[J] . Information Processing & Management an International Journal, 2007, 43(6):1588-1605.
[3] Wei Xing, Croft W B. LDA-based document models for Ad hoc retrieval[C] //Proc of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006:178-185.
[4] Krestel R, Fankhauser P, Nejdl W. Latent Dirichlet allocation for tag recommendation[C] //Proc of ACM Conference on Recommender Systems. 2009:61-68.
[5] Pruteanumalinici I, Ren Lu, Paisley J, et al. Hierarchical Bayesian modeling of topics in time-stamped documents[J] . IEEE Trans on Pattern Analysis & Machine Intelligence, 2010, 32(6):996-1011.
[6] Li Feifei, Perona P. A Bayesian hierarchical model for learning natural scene categories[C] //Proc of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington DC:IEEE Computer Society, 2005:524-531.
[7] Schlkopf B, Platt J, Hofmann T. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation[C] //Advances in Neural Information Processing Systems. Cambridge:MIT Press, 2006:1353-1360.
[8] Griffiths T L, Steyvers M. Finding scientific topics[J] . Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(Suppl 1):5228-5235.
[9] Zeng Jia, Cheung W K, Liu Jiming. Learning topic models by belief propagation[J] . IEEE Trans on Pattern Analysis & Machine Intelligence, 2011, 35(5):1121-1134.
[10] Yan Jianfeng, Zeng Jia, Gao Yang, et al. Communication-efficient algorithms for parallel latent Dirichlet allocation[J] . Soft Computing, 2015, 19(1):3-11.
[11] Rosen-Zvi M, Griffiths T, Steyvers M, et al. The author-topic model for authors and documents[C] //Proc of the 20th Conference on Uncertainty in Artificial Intelligence. 2004:487-494.
[12] Foulds J, Boyles L, Dubois C, et al. Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation[C] //Proc of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013:446-454.
[13] Yao Limin, Mimno D, McCallum A. Efficient methods for topic model inference on streaming document collections[C] //Proc of ACM Sigkdd International Conference on Knowledg Doscovery & Data Ming. 2009:937-946.
[14] Wang Chong, Paisley J W, Blei D M. Online variational inference for the hierarchical Dirichlet process[J] . Journal of Machine Learning Research, 2011, 15:752-760.
[15] Zeng Jia, Liu Zhiqiang, Cao Xiaoqin. Online belief propagation for topic modeling[J] . arXiv Preprint arXiv:1210. 2179, 2012.
[16] Wallach H M, Mimno D M, McCallum A. Rethinking LDA:why priors matter[C] //Advances in Neural Information Processing Systems. 2009:1973-1981.
收稿日期 2016/12/15
修回日期 2017/2/20
页码 1005-1009
中图分类号 TP391
文献标志码 A