《计算机应用研究》|Application Research of Computers

基于时空域深度特征两级编码融合的视频分类

Video classification based on cascaded encoding fusion of temporal and spatial deep features

免费全文下载 (已被下载 次)  
获取PDF全文
作者 智洪欣,于洪涛,李邵梅
机构 国家数字交换系统工程技术研究中心,郑州 450002
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)03-0926-04
DOI 10.3969/j.issn.1001-3695.2018.03.060
摘要 针对已有深度特征用于视频分类精度较低的不足,提出了一种新的基于视频时空域深度特征两级编码融合的视频分类方法。首先基于两个深度卷积神经网络模型分别提取视频帧的深度空域信息和深度时域信息;然后依次采用Fisher向量和局部聚合对上述时空域的深度信息进行两级级联编码,实现对视频的高效表征;最后基于两级编码后的时空域联合深度特征,利用支持向量机进行分类。在UCF101上的实验结果表明,与已有的方法相比,算法具有更好的分类精度。
关键词 视频分类;两级编码;深度学习;特征融合
基金项目 国家自然科学基金资助项目(61521003,61379151)
科技支撑计划资助项目(2014BAH30B01)
河南省杰出青年基金资助项目(144100510001)
国家杰出青年科学基金资助项目(61601513)
本文URL http://www.arocmag.com/article/01-2018-03-060.html
英文标题 Video classification based on cascaded encoding fusion of temporal and spatial deep features
作者英文名 Zhi Hongxin, Yu Hongtao, Li Shaomei
机构英文名 NationalDigitalSwitchingSystemEngineeringTechnologyR&DCenter,Zhengzhou450002,China
英文摘要 To solve the problem of low performance of deep features used for video classification, this paper proposed a novel method based on the fusion of video temporal and spatial information with two-level encoding method. Firstly, it used two convolutional neural networks (CNN) to respectively extract the video’s spatial and temporal information. Then encoding the spatial and temporal information with Fisher vector (FV) and locally aggregating method respectively to get the effective representation of video. Finally, based on the two-level cascaded fusion feature, it used support vector machine (SVM) to classify the videos. Experimental results on UCF101 show that their method has a better performance contrasting to the state of the art methods.
英文关键词 video classification; cascaded encoding; deep learning; feature fusion
参考文献 查看稿件参考文献
  [1] Lowe D G. Distinctive image features from scale-invariant keypoints[J] . International Journal of Computer Vision, 2004, 60(2):91-110.
[2] Laptev I. On space-time interest points[J] . International Journal of Computer Vision, 2005, 64(2-3):107-123.
[3] Aly R, Arandjelovic R, Chatfield K, et al. The AXES submissions at TrecVid 2013[R] . 2013.
[4] Xu Haiyan, Tian Qian, Wang Zhen, et al. A survey on aggregating methods for action recognition with dense trajectories[J] . Multimedia Tools & Applications, 2016, 75(10):1-17.
[5] Wang Heng, Schmid C. Action recognition with improved trajectories[C] //Proc of IEEE International Conference on Computer Vision. Washington DC:IEEE Computer Society, 2013:3551-3558.
[6] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C] //Proc of IEEE Confe-rence on Computer Vision and Pattern Recognition. Washington DC:IEEE Computer Society, 2014:1725-1732.
[7] Taylor G W, Fergus R, LeCun Y, et al. Convolutional learning of spatio-temporal features[C] //Proc of European Conference on Computer Vision. 2010:140-153.
[8] Wang Keze, Wang Xiaolong, Lin Liang, et al. 3D human activity recognition with reconfigurable convolutional neural networks[C] //Proc of the 22nd ACM International Conference on Mulltimedia. New York:ACM Press, 2014:97-106.
[9] Annane D, Chevrolet J C, Chevret S, et al. Two-Stream convolutio-nal networks for action recognition in videos[J] . Advances in Neural Information Processing Systems, 2014, 1(4):568-576.
[10] Ye Hao, Wu Zuxuan, Zhao Ruiwei, et al. Evaluating Two-Stream CNN for video classification[C] //Proc of the 5th ACM on International Conference on Multimedia Retrieval. New York:ACM Press, 2015:435-442.
[11] Sánchez J, Perronnin F, Mensink T, et al. Image classification with the Fisher vector:theory and practice[J] . International Journal of Computer Vision, 2013, 105(3):222-245.
[12] Jégou H, Douze M, Schmid C, et al. Aggregating local descriptors into a compact image representation[C] //Proc of IEEE Conference on Computer Vision and Pattern Recognition. [S. l. ] :IEEE Press, 2010:3304-3311.
[13] MironicI, Du瘙塅 I C, Ionescu B, et al. A modified vector of locally aggregated descriptors approach for fast video classification[J] . Multimedia Tools & Applications, 2016, 75(15):9045-9072.
[14] Wang Limin, Qiao Yu, Tang Xiaoou. Action recognition with trajectory-pooled deep-convolutional descriptors[C] //Proc of IEEE Confe-rence on Computer Vision and Pattern Recognition. 2015:4305-4314.
[15] Jia Yangqing, Shelhamer E, Donahue J, et al. Caffe:convolutional architecture for fast feature embedding[C] //Proc of the 22nd ACM International Conference on Multimedia. 2014:675-678.
[16] Zach C, Pock T, Bischof H. A duality based approach for realtime TV-L, 1 optical flow[C] //Lecture Notes in Computer Science, vol 4713. 2007:214-223.
[17] Chang C C, Lin C J. LIBSVM:a library for support vector machines[J] . ACM Trans on Intelligent Systems & Technology, 2007, 2(3):389-396.
[18] Vedaldi A, Fulkerson B. Vlfeat:an open and portable library of computer vision algorithms[C] //Proc of the 18th ACM International Conference on Multimedia. New York:ACM Press, 2010:1469-1472.
[19] Yue Jun, Mao Shanjun, Li Mei. A deep learning framework for hyperspectral image classification using spatial pyramid pooling[J] . Remote Sensing Letters, 2016, 7(9):875-884.
[20] Xu Zhongwen, Yang Yi, Hauptmann A G. A discriminative CNN video representation for event detection[C] //Proc of IEEE Confe-rence on Computer Vision and Pattern Recognition. [S. l. ] :IEEE Press, 2014:1798-1807.
收稿日期 2016/10/25
修回日期 2016/12/21
页码 926-929
中图分类号 TP391.4
文献标志码 A