《计算机应用研究》|Application Research of Computers

基于MapReduce的增量式数据集的相似性连接

MapReduce-based similarity join for incremental data set

免费全文下载 (已被下载 次)  
获取PDF全文
作者 徐媛媛,陈华辉
机构 宁波大学 信息科学与工程学院,浙江 宁波 315211
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2014)11-3369-06
DOI 10.3969/j.issn.1001-3695.2014.11.039
摘要 相似性连接,即利用相似函数度量数据之间的相似程度,满足条件后进行连接操作。MapReduce框架下已存在很多相似性连接算法,但仍然存在一些不足,如大量的索引加大时间、空间的开销;现有算法不能有效地完成增量式数据集的相似性连接等。针对海量增量式数据集进行了研究,采用抽样技术得到有效中枢,形成更为合理的分区,建立分区索引和分配原则,完成新增数据的相似性连接操作。实验证明,该算法能够有效地解决海量增量式数据集的相似性连接问题,验证了分区索引的建立,可以提高新增数据的相似性连接操作的效率。
关键词 海量增量式数据集;划分;相似性连接;MapReduce
基金项目 浙江省公益性技术应用研究计划资助项目(2011C21076)
本文URL http://www.arocmag.com/article/01-2014-11-039.html
英文标题 MapReduce-based similarity join for incremental data set
作者英文名 XU Yuan-yuan, CHEN Hua-hui
机构英文名 College of Information Science & Engineering, Ningbo University, Ningbo Zhejiang 315211, China
英文摘要 Similarity join was namely that using similar function to measure the similarity level of the data set, and then doing the join after meeting the condition. Many effective similarity join algorithms had been in mapreduce, but there were still some insufficiency, such as a lot of indexes increases the overhead of time and space;the existing algorithm couldn’t deal with the similarity computation of the incremental data set effectively, and so on. For massive incremental data set, this paper made use of sampling to get the valid pivots, which established partitions’ indexes and distribution principle, then finished the similarity join operation of additional data. The experiments prove that the algorithm can solve the problem of the similarity join of the incremental data set effectively, and verify that through creating partitions’ indexes, it can improve the efficiency of the similarity join operation of additional data.
英文关键词 massive incremental data set; partition; similarity join; MapReduce
参考文献 查看稿件参考文献
  [1] CHAUDHURI S, GANTI V, KAUSHIK R. A primitive operator for similarity joins in data cleaning[C] //Proc of the 22nd International Conference on Data Engineering. Atlanta, GA:[s. n. ] , 2006:5.
[2] XIAO Chuan, WANG Wei, LIN Xue-min, et al. Efficient similarity joins for near-duplicate detection[J] . ACM Trans on Database Systems, 2011, 36(3):15-24.
[3] SPERTUS E, SAHAMI M, BUYUKKOKTEN O. Evaluating similarity measures:a large-scale study in the orkut social network[C] //Proc of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York:ACM Press, 2005:678-684.
[4] BAYARDO R J, MA Yi-ming, SRIKANT R. Scaling up all pairs similarity search[C] // Proc of the 16th International Conference on World Wide Web. 2007:131-140.
[5] XIAO Chuan, WANG Wei, LIN Xue-min. Ed-join:an efficient algorithm for similarity joins with edit distance constraints[J] . Procee-dings of the VLDB Endowment, 2008, 1(1):933-944.
[6] WANG Jian-nan, FENG Jian-hua, LI Guo-liang. Trie-join:efficient trie-based string similarity joins with edit-distance constraints[J] . Proceedings of the VLDB Endowment, 2010, 3(1-2):1219-1230.
[7] ZHANG Zhen-jie, HADJIELEFTHERIOU M, OOI B C, et al. Bed-tree:an all-purpose index structure for string similarity search based on edit distance[C] //Proc of ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2010:915-926.
[8] VERNICA R, CAREY M J, LI Chen. Efficient parallel set-similarity joins using MapReduce[C] //Proc of ACM SIGMOD International Conference on Management of Data. New York:ACM Press, 2010:495-506.
[9] METWALLY A, FALOUTSOS C. V-smart-join:a scalable mapreduce framework for all-pair similarity joins of multisets and vectors[J] . Proceedings of the VLDB Endowment, 2012, 5(8):704-715.
[10] 荣垂田, 徐天任, 杜小勇. 基于划分的集合相似连接[J] . 计算机研究与发展, 2012, 49(10):2066-2076.
[11] 洪银杰, 陈刚, 陈珂. 基于分区索引的集合相似连接[J] . 浙江大学学报:工学版, 2012, 46(2):286-293.
[12] JACOX E H, SAMET H. Metric space similarity joins[J] . ACM Trans on Database Systems, 2008, 33(2):1-38. [13] ELSAYED T, LIN J, OARD D W. Pairwise document similarity in large collections with MapReduce[C] //Proc of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies. Cambridge, MA:Computational Linguistics, 2008:265-268.
[14] BARAGLIA R, De FRANCISCI M G, LUCCHESE C. Document similarity self-join with MapReduce[C] //Proc of the 10th IEEE International Conference on Data Mining. Sydney:IEEE Computer Society, 2010:731-736.
[15] KIM Y, SHIM K. Parallel top-K similarity join algorithms using Map-Reduce[C] //Proc of the 28th IEEE International Conference on Data Engineering. Washington DC:IEEE Computer Society, 2012:510-521.
[16] SEIDL T, FRIES S, BODEN B. MR-DSJ:distance-based self-join for large-scale vector data analysis with MapReduce[C] //Proc of the 15th Conference on Database System Technology and Web. Magdeburg:BTW, 2013:37-56.
[17] SILVA Y N, REED J M. Exploiting MapReduce-based similarity joins[C] //Proc of ACM SIGMOD International Conference on Mana-gement of Data. Scottsdale, Arizona:ACM Press, 2012:693-696.
[18] HADJIELEFTHERIOU M, KOUDAS N, SRIVASTAVA D. Incremental maintenance of length normalized indexes for approximate string matching[C] //Proc of ACM SIGMOD International Conference on Management of Data. Providence, RI:ACM Press, 2009:429-440.
[19] FRANK A, ASUNCION A. UCI machine learning repository[EB/OL] . (2010). http://archive. ics. uci. edu/ml/.
[20] 庞俊, 谷欲, 许嘉, 等. 相似性连接查询技术研究进展[J] . 计算机科学与探索, 2013, 7(1):3-15.
收稿日期 2013/10/18
修回日期 2013/12/10
页码 3369-3374,3384
中图分类号 TP311.1
文献标志码 A