《计算机应用研究》|Application Research of Computers

一种基于剪切的SLP向量化方法

SLP vectorization method based on throttling

免费全文下载 (已被下载 次)  
获取PDF全文
作者 李颖颖,奚慧兴,高伟,李伟,翟胜伟
机构 1.信息工程大学,郑州 450002;2.数学工程与先进计算国家重点实验室,郑州 450002;3.鞍山师范学院,辽宁 鞍山 114007;4.中国电子科技集团公司第二十七研究所,郑州 450047
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)09-2578-05
DOI 10.3969/j.issn.1001-3695.2018.09.004
摘要 作为多媒体和科学计算等领域重要的程序加速器件之一,SIMD扩展部件现已广泛集成于各类处理器中。自动向量化方法是目前生成SIMD向量化程序的重要手段。超字并行SLP(superword level parallelism)方法现已广泛应用于编译器中,并成为实现基本块级代码向量化的主要手段。SLP在进行收益评估时仅考虑代码段整体向量化的收益,并没有考虑到向量化收益为负的片段会降低最终整体的向量化收益,从而导致SLP方法无法达到最好的向量化效果。基于此,提出了一种基于剪切的SLP向量化方法(throttling SLP,TSLP)。通过寻找最优的向量化子图,去除了向量化收益为负的代码段,从而可以获得更好的向量化效果。通过标准测试程序的实验结果表明,与原来的SLP方法相比,TSLP方法平均能够获得9%的性能提升。
关键词 单指令多数据扩展部件;自动向量化;超字并行;代价模型
基金项目 国家自然科学基金资助项目(61472447)
国家“863”计划资助项目(2014AA01A300)
国家“核高基”重大专项资助项目(2013ZX0102-8001-001-001)
本文URL http://www.arocmag.com/article/01-2018-09-004.html
英文标题 SLP vectorization method based on throttling
作者英文名 Li Yingying, Xi Huixing, Gao Wei, Li Wei, Zhai Shengwei
机构英文名 1.InformationEngineeringUniversity,Zhengzhou450002,China;2.StateKeyLaboratoryofMathematicalEngineering&AdvancedComputing,Zhengzhou450002,China;3.AnshanNormalUniversity,AnshanLiaoning114007,China;4.The27thResearchInstitute,ChinaElectronicsTechnologyGroupCorporation,Zhengzhou450047,China
英文摘要 SIMD vectors are widely adopted in modern general purpose processors as they can boost performance and energy efficiency for media and scientific applications.Compiler-based automatic vectorization is one approach for generating code that makes efficient use of the SIMD units.The SLP vectorization algorithm is the most well-known implementation of automatic vectorization.Choosing whether to vectorize is a one-off decision for the whole graph that has been generated.However, this is sub-optimal because the graph may contain code that is harmful to vectorization due to the need to move data from scalar registers into vectors.Therefore, this paper proposed a solution to overcome this limitation by introducing throttling SLP (TSLP), a novel vectorization algorithm that finds the optimal graph to vectorize.The decision did not consider the potential benefits of throttling the graph by removing this harmful code.The experiments show that TSLP can decrease execution time by 9% compared to SLP on average.
英文关键词 SIMD extension; auto-vectorization; superword level parallelism(SLP); cost model
参考文献 查看稿件参考文献
  [1] 高伟, 赵荣彩, 韩林, 等. SIMD自动向量化编译优化概述[J] . 软件学报, 2015, 26(6):1265-1284.
[2] Huo Xin, Ren Bin, Agrawal G. A programming system for Xeon Phis with runtime SIMD parallelization[C] //Proc of the 28th ACM International Conference on Supercomputing. New York:ACM Press, 2014:283-292.
[3] Ramachandran A, Vienne J, Van Der Wijngaart R. Performance evaluation of NAS parallel benchmarks on Intel Xeon Phi[C] //Proc of the 42nd International Conference on Parallel Processing. Washington DC:IEEE Computer Society, 2014:736-743.
[4] Allen R, Kennedy K. Optimizing compilers for modern architectures[M] . San Francisco:Morgan Kaufmann Publishers, 2001.
[5] Nuzman D, Zaks A. Outer-loop vectorization-revisited for short SIMD architectures[C] //Proc of the 17th International Conference on Parallel Architectures and Compilation Techniques. 2008.
[6] Trifunovic K, Nuzman D, Cohen A, et al. Polyhedral-model guided loop-nest auto-vectorization[C] //Proc of the 18th International Conference on Parallel Architectures and Compilation Techniques. Piscataway, NJ:IEEE Press, 2009.
[7] Kong M, Veras R, Stock K. When polyhedral transformations meet SIMD code generation[C] //Proc of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York:ACM Press, 2013.
[8] Larsen S, Amarasinghe S. Exploiting superword level parallelism with multimedia instruction sets[C] //Proc of ACM SIGPLAN Conference on Programming Language Design and Implementation. New York:ACM Press, 2000:145-156.
[9] Chang H, Sung W. Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware[C] //Proc of International Conference on Compilers, Architectures and Synthesis for Embedded Systems. New York:ACM Press, 2008:167-176.
[10] Eichenberger A E, Wu Peng, O′Brien K. Vectorization for SIMD architectures with alignment constraints[C] //Proc of ACM SIGPLAN Conference on Programming Language design and Implementation. New York:ACM Press, 2004:82-93.
[11] Ren Gang, Wu Peng, Padua D. Optimizing data permutations for SIMD devices[C] //Proc of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York:ACM Press, 2006:118-131.
[12] Nuzman D, Rosen I, Zaks A. Auto-vectorization of interleaved data for SIMD[C] //Proc of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York:ACM Press, 2006:132-143.
[13] Maleki S, Gao Yaoqing, Garzaran M J, et al. An evaluation of vectorizing compilers[C] //Proc of International Conference on Parallel Architectures and Compilation Techniques. Washington DC:IEEE Computer Society, 2011:372-382.
[14] Shin J, Hall M, Chame J. Superword-level parallelism in the presence of control flow[C] //Proc of International Symposium on Code Generation and Optimization. Piscataway, NJ:IEEE Press, 2005.
[15] Porpodas V, Magni A, Jones T M. PSLP:padded SLP automatic vectorization[C] //Proc of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization. Washington DC:IEEE Computer Society, 2015:190-201.
[16] Barik R, ZhaoJisheng, Sarkar V. Efficient selection of vector instructions using dynamic programming[C] //Proc of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. Washington DC:IEEE Computer Society, 2010:201-212.
[17] Holewinski J, Ramamurthi R, Ravishankar M, et al. Dynamic tracebased analysis of vectorization potential of applications[C] //Proc of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York:ACM Press, 2012:371-382.
收稿日期 2017/4/18
修回日期 2017/6/12
页码 2578-2582
中图分类号 TP301.6
文献标志码 A