《计算机应用研究》|Application Research of Computers

一种面向HDFS中海量小文件的存取优化方法

Optimization of massive small files storage and accessing on HDFS

免费全文下载 (已被下载 次)  
获取PDF全文
作者 顾玉宛,王文闻,孙玉强
机构 常州大学 信息科学与工程学院,江苏 常州 213164
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2017)08-2319-05
DOI 10.3969/j.issn.1001-3695.2017.08.019
摘要 为了解决HDFS(Hadoop distributed file system)在存储海量小文件时遇到的NameNode内存瓶颈等问题,提高HDFS处理海量小文件的效率,提出一种基于小文件合并与预取的存取优化方案。首先通过分析大量小文件历史访问日志,得到小文件之间的关联关系,然后根据文件相关性将相关联的小文件合并成大文件后再存储到HDFS。从HDFS中读取数据时,根据文件之间的相关性,对接下来用户最有可能访问的文件进行预取,减少了客户端对NameNode节点的访问次数,提高了文件命中率和处理速度。实验结果证明,该方法有效提升了Hadoop对小文件的存取效率,降低了NameNode节点的内存占用率。
关键词 海量小文件;文件相关性;合并;预取
基金项目 国家自然科学基金资助项目(11271057,61640211)
江苏省普通高校研究生科研创新计划项目(SCZ1412800004)
本文URL http://www.arocmag.com/article/01-2017-08-019.html
英文标题 Optimization of massive small files storage and accessing on HDFS
作者英文名 Gu Yuwan, Wang Wenwen, Sun Yuqiang
机构英文名 SchoolofInformationScience&Engineering,ChangzhouUniversity,ChangzhouJiangsu213164,China
英文摘要 In order to solve the problem of NameNode memory bottleneck when HDFS stored a massive amount of small files, this paper proposed an optimization of massive small files storage and accessing on HDFS to improve the efficiency of HDFS. First, it could get the relationship between small files by analyzing a large number of history access logs, and then merged these correlative small files into a big file which would be stored on HDFS. When the client read data from HDFS, the system would prefetch the related files which were most likely to be visited next according to the relevance of small files to reduce the number of request for NameNode, thereby increasing the hit rate and processing speed. The results of experiment show that this method can effectively improve the efficiency of storing and accessing mass small files on HDFS, and cuts down the memory utilization of NameNode.
英文关键词 massive small files; relationship between files; merge; prefetch
参考文献 查看稿件参考文献
  [1] Tom W. Hadoop权威指南[M] . 北京:清华大学出版社, 2010.
[2] Hadoop archives[EB/OL] . http://hadoop. apache. org/common/docs/current /hadoop_archives. html.
[3] Sequence file Wiki[EB/OL] . http://wiki. apache. org/hadoop/Sequence File.
[4] MapFile[EB/OL] . http://hadoop. apache. org/common/docs/current/api/org/ apache/hadoop/io/MapFile. html.
[5] 张海, 马建红. 基于HDFS的小文件存储与读取优化策略[J] . 计算机系统应用, 2014, 23(5):167-171.
[6] 刘小俊, 徐正全, 潘少明. 一种结合RDBMS和Hadoop的海量小文件存储方法[J] . 武汉大学学报:信息科学版, 2013, 38(1):113-115.
[7] 游小容, 曹晟. 海量教育资源中小文件的存储研究[J] . 计算机科学, 2015, 42(10):76-80.
[8] 黄启峰, 郑纬民, 沈美明. 一种机群文件系统的缓存模型[J] . 小型微型计算机系统, 2003, 24(10):1748-1752.
[9] Konstantin S, Hairing K, Sanyjy R, et al. The Hadoop distributed file system[C] //Proc of the 26th Symposium on Mass Storage Systems and Technologies. 2010:1-10.
[10] Chandrasekar S, Dakshinamurthy R, Seshakumar P G, et al. A novel indexing scheme for efficient handing of small files in Hadoop distri-buted file system[C] //Proc of International Conference on Computer Communication and Informatics. Piscataway:IEEE Press, 2013:1-8.
收稿日期 2016/8/19
修回日期 2016/9/26
页码 2319-2323
中图分类号 TP391
文献标志码 A