《计算机应用研究》|Application Research of Computers

基于分布的中文词表示研究

Study of distributional representation of Chinese words

免费全文下载 (已被下载 次)  
获取PDF全文
作者 曹学飞,李济洪,王瑞波
机构 山西大学 软件学院,太原 030006
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)03-009-0687-04
DOI 10.19734/j.issn.1001-3695.2017.09.0917
摘要 针对基于分布的中文词表示构造过程中的参数选择问题进行了系统性的研究。选择了六种参数进行对比实验,在中文语义相似度任务上对不同参数设置下得到的中文词表示的质量进行了评估。实验结果表明,通过选择合适的参数,基于分布的词表示在中文语义相似度任务上能够得到较高的性能,而且这种高维的词分布表示的质量甚至优于目前流行的基于神经网络(Skip-gram)或矩阵分解(GloVe)得到的低维的词表示。
关键词 分布表示;语义相似度;逐点互信息
基金项目 国家社会科学规划基金资助项目(16BTJ034)
本文URL http://www.arocmag.com/article/01-2019-03-009.html
英文标题 Study of distributional representation of Chinese words
作者英文名 Cao Xuefei, Li Jihong, Wang Ruibo
机构英文名 SchoolofSoftwareEngineering,ShanxiUniversity,Taiyuan030006,China
英文摘要 To solve the problem of parameters selection in the process of constructing the distributional representations of Chinese words, this paper performed a systematic study. It selected six kinds of parameters for comparison experiments, and evaluated the quality of the distributional representations of Chinese words obtained under different parameter settings on the Chinese semantic similarity task. The experimental results show that, by choosing appropriate parameters, the distributional representations of Chinese words can also get higher performance on the similarity task, moreover, the quality of such high-dimensional distributional representations is even superior to low-dimensional word representations based on neural network or matrix factorization.
英文关键词 distributional representation; semantic similarity; pointwise mutual information
参考文献 查看稿件参考文献
  [1] Baroni M, Dinu G, Kruszewski G. Don’t count, predict!A systema-tic comparison of context-counting vs. context predicting semantic vectors[C] //Proc of the 52nd Annual Meeting of the Association for Computational Linguistics. [S. l. ] :Association for Computational Linguistics, 2014:238-247.
[2] Harris Z S. Distributional structure[M] . 1954:146-162.
[3] Milajevs D, Sadrzadeh M, Purver M. Robust co-occurrence quantification for lexical distributional semantics[C] //Proc of the 54th Annual Meeting of the ACL Student Research Workshop. [S. l. ] :Association for Computational Linguistics, 2016:58-64.
[4] Church P, Hanks P. Word association norms, mutual information, and lexicography[C] //Proc of the 27th Annual Meeting on Association for Computational Linguistics. [S. l. ] :Association for Computational Linguistics, 1989:76-83.
[5] Turney P D, Pantel P. From frequency to meaning:vector space models of semantics[J] . Journal of Artificial Intelligence Research, 2010, 37(1):141-188.
[6] Bullinaria J A, Levy J P. Extracting semantic representations from word co-occurrence statistics:a computational study[J] . Behavior Research Methods, 2007, 39(3):510-526.
[7] Kiela D, Clark S. A systematic study of semantic vector space model parameters[C] //Proc of the 2nd Workshop on Continuous Vector Space Models & Their Compositionality. 2014:21-30.
[8] Evert S. The statistics of word cooccurrences:word pairs and collocations[D] . Stuttgart :University of Stuttgart, 2004.
[9] Levy O, Goldberg Y, Dagan I. Improving distributional similarity with lessons learned from word embeddings[J] . Bulletin De La Société Botanique De France, 2015, 75(3):552-555.
[10] Mikolov T, Chen Kai, Corrado G S, et al. Efficient estimation of word representations in vector space[EB/OL] . (2013-01-16). [2013-09-07] . http://cn. arxiv. org/abs/1301. 3781.
[11] Pennington J, Socher R, Manning C D. Glove:global vectors for word representation[C] //Proc of Conference on Empirical Methods in Natural Language Processing. 2014:1532-1543.
[12] Hill F, Kiela D, Korhonen A. Concreteness and corpora:a theoretical and practical analysis[C] //Proc of Workshop on Cognitive Modeling and Computational Linguistics. 2013:75-83.
[13] 汪祥, 贾焰, 周斌, 等. 基于中文维基百科链接结构与分类体系的语义相关度计算[J] . 小型微型计算机系统, 2011, 32(11):2237-2242. (Wang Xiang, Jia Yan, Zhou Bin, et al. Computing semantic relatedness using Chinese Wikipedia links and Taxonomy[J] . Journal of Chinese Computer Systems, 2011, 32(11):2237-2242. )
[14] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J] . 中文计算语言学, 2002, 7(2):59-76. (Liu Qun, Li Sujian. Word semantic similarity calculation based on HowNet[J] . Chinese Computational Linguistics, 2002, 7(2):59-76. )
[15] Chen Xinxiong, Xu Lei, Liu Zhiyuan, et al. Joint learning of character and word embeddings[C] //Proc of the 24th International Joint Conference on Artificial Intelligence. [S. l. ] :AAAI Press, 2015:1236-1242.
[16] Lebret R, Collobert R. Rehabilitation of count based models for word vector representations[C] //Proc of Computational Linguistics and Intelligent Text Processing. Berlin:Springer, 2015:417-429.
收稿日期 2017/9/12
修回日期 2017/11/17
页码 687-690
中图分类号 TP391
文献标志码 A