《计算机应用研究》|Application Research of Computers

相似索引:适用于重复数据删除的二级索引

Similar index: two-level index used for deduplication

免费全文下载 (已被下载 次)  
获取PDF全文
作者 张志珂,蒋泽军,蔡小斌,彭成章
机构 西北工业大学计算机学院,西安710072
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2013)12-3614-04
DOI 10.3969/j.issn.1001-3695.2013.12.025
摘要 由于EB(extreme binning)使用文件的最小块签名作为文件的特征, 它不适合处理主要包括小文件的数据负载, 会导致较差的重复数据删除率。为了改进EB, 提出了相似索引。它把相似哈希作为文件的特征, 是一种适用于以小文件为主的数据负载的重复数据删除的二级索引。实验结果表明, 相似索引的重复数据删除率比EB高24. 8%; 相似索引的内存使用量仅仅是EB的0. 265%。与EB相比, 相似索引需要更少的存储使用量和内存使用量。
关键词 重复数据删除;相似哈希;相似索引;块查找磁盘瓶颈问题;二级索引
基金项目 陕西省自然科学基金资助项目(2010JM8023)
航空科学基金资助项目(2010ZD53042)
本文URL http://www.arocmag.com/article/01-2013-12-025.html
英文标题 Similar index: two-level index used for deduplication
作者英文名 ZHANG Zhi-ke, JIANG Ze-jun, CAI Xiao-bin, PENG Cheng-zhang
机构英文名 School of Computer, Northwestern Polytechnical University, Xi'an 710072, China
英文摘要 However, since EB (extreme binning) utilized the minimum chunk ID of a file as the representative chunk signature, EB was not suitable for backup data stream mainly containing small files. To improve EB, this paper proposed simi index using simi hash as the feature of a file. It was a novel two-level index suitable for workload mainly consisting of small files. Experiment results show that, the deduplication efficiency of simi index is 24. 8% better than EB, and the RAM usage of simiIndex only 0. 265% of that of EB. Compared with EB, simi index needs less storage and less RAM.
英文关键词 deduplication; simi hash; similar index; chunk-lookup disk bottleneck problem; two-level index
参考文献 查看稿件参考文献
  [1] ESHGHI K, LILLIBRIDGE M, WILCOCK L, et al. Jumbo store:providing efficient incremental upload and versioning for a utility rendering service[C] //Proc of the 5th USENIX Conference on File and Storage Technologies. Berkeley:USENIX, 2007:123-138.
[2] ZHU B, LI Kai, PATTERSON H. Avoiding the disk bottleneck in the data domain deduplication file system[C] // Proc of the 6th USENIX Conference on File and Storage Technologies. Berkeley:USENIX, 2008:269-282.
[3] LILLIBRIDGE M, ESHGHI K, BHAGWAT D, et al. Sparse indexing:large scale, inline deduplication using sampling and locality[C] //Proc of the 7th Conference on File and Storage Technologies. Berkeley:USENIX, 2009:111-123.
[4] BHAGWAT D, ESHGHI K, LONG D, et al. Extreme binning:scalable, parallel deduplication for chunk-based file backup[C] // Proc of IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems. Washington DC:IEEE Computer Society, 2009:1-9.
[5] ARONOVICH L, ASHER R, BACHMAT E, et al. The design of a similarity based deduplication system[C] // Proc of SYSTOR :The Israeli Experimental Systems Conference. New York:ACM Press, 2009:6.
[6] ROMANSKI B, HELDT L, KILIAN W, et al. Anchor-driven subchunk deduplication[C] // Proc of SYSTOR 2011:The Israeli Experimental Systems Conference. New York:ACM Press, 2011:16.
[7] ZHANG Zhi-ke, BHAGWAT D, LITWIN W, et al. Improved deduplication through parallel binning[C] // Proc of the 31st IEEE International Performance Computing and Communications Conference. Washington DC:IEEE Compurter Society, 2012:130-141.
[8] ZHANG Zhi-ke, JIANG Ze-jun, LIU Zhi-qiang, et al. LHs:a novel method of information retrieval avoiding an index using linear hashing with key groups in deduplication[C] // Proc of International Conference on Machine Learning and Cybernetics. Washington DC:IEEE Compurter Society, 2012:1312-1318.
[9] DUBNICKI C, GRYZ L, HELDT L, et al. Hydrastor:a scalable secondary storage[C] // Proc of the 7th Conference on File and Storage Technologies. Berkeley:USENIX, 2009:97-210.
[10] UNGUREANU C, ATKIN B, ARANYA A, et al. Hydrafs:a high-throughput file system for the hydrastor content-addressable storage system[C] // Proc of the 8th USENIX Conference on File and Storage Technologies. Berkeley:USENIX, 2010:225-238.
[11] DONG Wei, DOUGLIS F, LI Kai, et al. Tradeoffs in scalable data routing for deduplication clusters[C] // Proc of the 9th USENIX Conference on File and Storage Technologies. Berkeley:USENIX, 2011:15-29.
[12] SADOWSKI C, LEVIN G. Simihash:hash-based similarity detection[R] . Santa Cruz:University of California at Santa Cruz, 2011.
[13] RABIN M. Fingerprinting by random polynomials[R] . Cambridge:Harvard University, 1981.
[14] FORMAN G, ESHGHI K, CHIOCCHETTI S. Finding similar files in large document repositories[C] // Proc of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. New York:ACM Press, 2005:394-400.
收稿日期
修回日期
页码 3614-3617
中图分类号 TP301.6
文献标志码 A