《计算机应用研究》|Application Research of Computers

面向基于关键词的相似性搜索的嵌入方法有效性分析

Effectiveness of embedding methods for keyword-based similarity search

免费全文下载 (已被下载 次)  
获取PDF全文
作者 王梦红,王骞
机构 武汉大学 计算机学院,武汉 430072
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2017)09-2659-07
DOI 10.3969/j.issn.1001-3695.2017.09.021
摘要 FastMap、SparseMap、BoostMap被认为是适用于任何度量空间的嵌入方法。然而之前的研究者高估了它们的适用性,它们在基于关键词的度量空间中并不适用。为了评估它们在关键词空间中的适用性,通过将它们实例化到基于关键词的相似性搜索的场景中,利用嵌入方法与局部敏感哈希相结合的方法,针对它们的嵌入效果进行了研究。重点从精确度、召回率、应力(stress)和距离保存效率方面,给出了它们在不同数据集上的实验结果。发现它们在基于关键词的度量空间中的嵌入效果并不好,得出了它们并不适用于所有的度量空间的结论,并分析了其效果不好的原因。
关键词 嵌入方法;关键词空间;相似性搜索;FastMap;SparseMap;BoostMap
基金项目 国家自然科学基金资助项目(61373167)
国家重点基础研究发展计划资助项目(2014CB340600)
本文URL http://www.arocmag.com/article/01-2017-09-021.html
英文标题 Effectiveness of embedding methods for keyword-based similarity search
作者英文名 Wang Menghong, Wang Qian
机构英文名 CollegeofComputer,WuhanUniversity,Wuhan430072,China
英文摘要 Some researchers consider FastMap, SparseMap and BoostMap are applicable in any metric spaces. However, they overestimated their practicability, and found that they were not applicable in the metric spaces based on keywords. To evaluate their practicability in the keyword space, this paper instantiated them into the scenario of keyword-based similarity search, and used a framework consisting of embedding methods and locality sensitive hash(LSH), and evaluated them in terms of precision, recall, stress and effectiveness in distance preservation. In addition, it provided some analysis of them. The experiments show that their embedding effectiveness in the keyword space are not good. Thus, this paper concludes that these embedding methods are not applicable for all metric spaces.
英文关键词 embedding methods; keyword space; similarity search; FastMap; SparseMap; BoostMap
参考文献 查看稿件参考文献
  [1] Faloutsos C, Lin K I. FastMap:a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets[J] . ACM SIGMOD Record, 1995, 24(2):163-174.
[2] Kuzu M, Islam M S, Kantarcioglu M. Efficient similarity search over encrypted data[C] //Proc of the 28th IEEE International Conference on Data Engineering. 2012:1156-1167.
[3] Hjaltason G R, Samet H. Properties of embedding methods for similarity searching in metric spaces[J] . IEEE Trans on Pattern Analysis and Machine Intelligence, 2003, 25(5):530-549.
[4] Gabriela H, Martin F. Cluster-preserving embedding of proteins[R] . Piscataway, NJ:Rutgers University, 1999:621-624.
[5] Athitsos V, Alon J, Sclaroff S, et al. BoostMap:an embedding method for efficient nearest neighbor retrieval[J] . IEEE Trans on Pattern Analysis and Machine Intelligence, 2008, 30(1):89-104.
[6] Kruskal J B, Wish M. Multidimensional scaling[R] . Calif:Sage University, 1978.
[7] Indyk P, Motwani R. Approximate nearest neighbors:towards removing the curse of dimensionality[C] //Proc of the 30th ACM Symposium on Theory of Computing. New York:ACM Press, 1998:604-613.
[8] Rajaraman A, Ullman J D. Mining of massive datasets[M] . [S. l. ] :Cambridge University Press, 2012.
[9] Bourgain J. On Lipschitz embedding of finite metric spaces in Hilbert space[J] . Israel Journal of Mathematics, 1985, 52(1):46-52.
[10] Johnson W B, Lindenstrauss J. Extensions of Lipschitz maps into a Hilbert space[J] . Contemporary Mathematics, 1984, 26:189-206.
[11] Schapire R E, Singer Y. Improved boosting algorithms using confidence-rated predictions[J] . Machine Learning, 1999, 37(3):297-336.
[12] Enron email dataset[EB/OL] . (2015). http://www. cs. cmu. edu/~enron/.
[13] Typo generator[EB/OL] . (2011). https://dbappserv. cis. upenn. edu/spell/.
收稿日期 2016/7/1
修回日期 2016/8/29
页码 2659-2665
中图分类号 TP391
文献标志码 A