《计算机应用研究》|Application Research of Computers

基于Hadoop平台的并行特征匹配算法研究

Research of parallel feature matching algorithm based on Hadoop

免费全文下载 (已被下载 次)  
获取PDF全文
作者 李宝禄,张伟
机构 北京信息科技大学 a.计算机学院;b.网络文化与数字传播北京市重点实验室,北京 100101
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2014)11-3320-04
DOI 10.3969/j.issn.1001-3695.2014.11.027
摘要 很多大企业采用Hadoop分布式文件系统来存储海量数据,而传统的病毒扫描主要针对单机系统环境。研究如何并行化病毒扫描中的核心特征匹配算法来处理分布式海量数据。在Hadoop平台下,基于MapReduce并行编程模型来实现大数据高效的病毒扫描,特别是针对Hadoop处理海量小文件效率低的问题,通过将小文件合并,再利用索引来提高海量小文件的处理效率。实验结果表明,提出的并行特征匹配算法可以显著降低处理时间,适用于大数据的病毒扫描。
关键词 分布式文件系统;大数据;特征匹配;并行扫描
基金项目 北京市优秀人才培养资助项目(2012D005007000009)
北京信息科技大学网络文化与数字传播北京市重点实验室开放课题(ICDD201306)
北京市属高等学校创新团队建设与教师职业发展计划项目(IDHT20130519)
本文URL http://www.arocmag.com/article/01-2014-11-027.html
英文标题 Research of parallel feature matching algorithm based on Hadoop
作者英文名 LI Bao-lu, ZHANG Wei
机构英文名 a. School of Computer Science, b. Beijing Key Laboratory of Internet Culture & Digital Dissemination Research, Beijing Information Science & Technology University, Beijing 100101, China
英文摘要 Many enterprises use Hadoop distributed file system to store mass data, but traditional virus scanning mainly face to single machine. This paper studied the way to make the core feature matching algorithm of virus scanning parallel to deal with the mass distributed data. With the frame of MapReduce, on the Hadoop platform, this paper realized efficient virus scanning of big data. Especially for the problem of low efficiency of processing mass small files on Hadoop platform, it incorporated small files, then used index to improve the efficiency of virus scanning of mass small files. The experimental results show that the parallel feature matching algorithm can reduce the processing time significantly, and is applicable to virus scanning of big data.
英文关键词 distributed file system; big data; feature matching; parallel scanning
参考文献 查看稿件参考文献
  [1] SHVACHKO K, KUANG Hai-rong, RADIA S, et al. The Hadoop distributed file system[C] //Proc of the 26th IEEE Symposium on Mass Storage Systems and Technologies. [S. l. ] :IEEE Press, 2010:1-10.
[2] 崔杰, 李陶深, 兰红星. 基于Hadoop的海量数据存储平台设计与开发[J] . 计算机研究与发展, 2012, 49(S1):12-18.
[3] GHENLAWAL S, HOWARD G, LEUNG S H L. The Google file system[C] //Proc of the 19th ACM Symposium on Operating Systems Principles. New York:ACM Press, 2003:29-43.
[4] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[C] //Proc of the 6th Conference on Operating Systems Design & Implementation. New York:ACM Press, 2004:137-150.
[5] MORAVEJI R, TAHERI J, MOHAMMADREZA F, et al. Data-intensive workload consolidation for the Hadoop distributed file system[C] //Proc of the 13th ACM/IEEE International Conference on Grid Computing. [S. l. ] :IEEE Computer Society, 2012:95-103. [6] DUAN Song-qing, WU Bin, WANG Bai, et al. Design and implementation of parallel statiatical algorithm based on Hadoop’s MapReduce model[C] //Proc of IEEE International Conference on Cloud Computing and Intelligence Systems. [S. l. ] :IEEE Press, 2011:134-138.
[7] 赵卫中, 马慧芳, 傅燕翔, 等. 基于云计算平台Hadoop的并行K-means聚类算法设计研究[J] . 计算机科学, 2011, 38(10):166-167.
[8] MACKEY G, SEHRISH S, WANG Jun. Improving metadata management for small files in HDFS[C] //Proc of IEEE International Conference on Cluster Computing and Workshops. [S. l. ] :IEEE Press, 2009:1-4.
[9] WU Sun, MANBER U. A fast algorithm for multi-pattern searching, TR-94-17[R] . Arizona:Dept. of Computer Science, University of Arizona, 1994.
[10] HUANG Lu, CHEN Hai-shan, HU Ting-ting. Research on Hadoop cloud computing model and its applications[C] //Proc of the 3rd International Conference on Networking and Distributed Computing. [S. l. ] :IEEE Press, 2012:59-63.
[11] 余思, 桂小林, 黄汝维, 等. 一种提高云存储中小文件存储效率的方案[J] . 西安交通大学学报, 2011, 45(6):177-181.
[12] 张春明, 芮建武, 何婷婷. 一种Hadoop小文件存储和读取的方法[J] . 计算机应用与软件, 2012, 29(11):95-100.
[13] 泰冬雪. 基于Hadoop的海量小文件处理方法的研究[D] . 沈阳:辽宁大学, 2011.
[14] CHANDRASEKAR S, DAKSHINAMURTHY R, SESHAKUMAR P G, et al. A novel indexing scheme for efficient handling of small files in Hadoop distributed file system[C] //Proc of International Conference on Computer Communication and Informatics. [S. l. ] :IEEE Press, 2013:1-8.
收稿日期 2013/11/28
修回日期 2014/1/2
页码 3320-3323
中图分类号 TP301.6
文献标志码 A