《计算机应用研究》|Application Research of Computers

基于URL模式集的主题爬虫

Focused crawler based on URL patterns

免费全文下载 (已被下载 次)  
获取PDF全文
作者 胡萍瑞,李石君
机构 武汉大学 计算机学院,武汉 430072
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)03-0694-06
DOI 10.3969/j.issn.1001-3695.2018.03.012
摘要 为提高主题爬虫的性能,依据站点信息组织的特点和URL的特征,提出一种基于URL模式集的主题爬虫。爬虫分两个阶段,在实验爬虫阶段,采集站点样本数据,采用基于URL前缀树的模式构建算法构建URL模式,形成模式关系图,并利用HITS算法分析该模式关系图,计算出各模式的重要度;在聚焦爬虫阶段,无须预先下载页面,即可利用生成的URL模式判断页面是否主题相关和能否指导爬虫深入抓取,并根据URL模式的重要度预测待抓取链接优先级。实验表明,该爬虫相比现有的主题爬虫能快速引导爬虫抓取主题相关页面,保证爬虫的查准率和查全率,有效提高爬虫抓取效率。
关键词 主题爬虫;URL模式;URL前缀树;模式关系图;URL模式重要性
基金项目 国家自然科学基金资助项目(61272109,61502350)
本文URL http://www.arocmag.com/article/01-2018-03-012.html
英文标题 Focused crawler based on URL patterns
作者英文名 Hu Pingrui, Li Shijun
机构英文名 CollegeofComputer,WuhanUniversity,Wuhan430072,China
英文摘要 To improve the performance of the focused crawler, according to the features of site information organization and URL, this paper proposed an UPFC(focused crawler based on URL patterns) which in a two-phase framework. In the experimental crawler phase, it collected the site samples and built the URL patterns by the pattern construction algorithm based on URL prefix tree. Additionally, it adopted the HITS algorithm to calculate the importance of patterns based on the pattern graph. In the focused crawler phase, the topic relevance and the guiding significance of pages were determined by those URL patterns without pre-downloading, and the priority of links to be crawled were predicted according to the importance of URL patterns. Experimental results prove that the crawler can be guided to crawl the relevant pages quickly, guarantee the precision and recall, and improve the crawling efficiency.
英文关键词 focused crawler; URL pattern; URL prefix tree; pattern graph; importance of URL pattern
参考文献 查看稿件参考文献
  [1] Deepika, Dixit A. Web crawler design issues:a review[J] . International Journal of Management It & Engineering, 2012, 2(8):394-404.
[2] 杨肖. 基于主题的互联网信息抓取研究[D] . 杭州:浙江大学, 2014.
[3] 张宇, 宋巍, 刘挺, 等. 基于URL主题的查询分类方法[J] . 计算机研究与发展, 2012, 49(6):1298-1305.
[4] Cho J, Garcia-Molina H, Page L. Efficient crawling through URL ordering[J] . Computers Networks and ISDN Systems, 1998, 30(1-7):161-172.
[5] Du Yajun, Liu Wenjun, Lv Xianjing, et al. An improved focused crawler based on semantic similarity vector space model[J] . Applied Soft Computing, 2015, 36(C):392-407.
[6] Davison B D. Topical locality in the Web[C] //Proc of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press, 2000:272-279.
[7] Zheng Xiaolin, Zhou Tao, Yu Zukun, et al. URL rule based focused crawler[C] //Proc of IEEE International Conference on E-Business Engineering. Washington DC:IEEE Computer Society, 2008:147-154.
[8] Almpanidis G, Kotropoulos C, Pitas I. Combining text and link ana-lysis for focused crawling—an application for vertical search engines[J] . Information Systems, 2007, 32(6):886-908.
[9] Ling Zhang, Zheng Qin. The improved Pagerank in Web crawler[C] //Proc of the 1st IEEE International Conference on Information Science and Engineering. Washington DC:IEEE Computer Society, 2009:1889-1892.
[10] Liu Hongyu, Janssen J, Milios E. Using HMM to learn user browsing patterns for focused Web crawling[J] . Data & Knowledge Engineering, 2006, 59(2):270-291.
[11] Hernández I, Rivero C R, Ruiz D, et al. CALA:an unsupervised URL-based Web page classification system[J] . Knowledge-Based Systems, 2014, 57(2):168-180.
[12] Lei Tao, Cai Rui, Yang Jiangming, et al. A pattern tree-based approach to learning URL normalization rules[C] //Proc of the 19th International Conference on World Wide Web. New York:ACM Press, 2010:611-620.
[13] Wang Hui, Yan Jianzhuo, Fang Liying, et al. The desgin and implementation of vertical search engine based on nutch[C] //Proc of the 3rd International Conference on Multimedia Technology. [S. l. ] :Atlantis Press, 2013:1430-1441.
[14] 李华康, 赖龙彬, 陈光宣, 等. 一种基于URL语法规则的欺诈网站识别方法[J] . 计算机科学, 2015, 42(B10):28-33.
[15] Liu Minghai, Cai Rui, Zhang Ming, et al. User browsing behavior-driven web crawling[C] //Proc of the 20th ACM International Confe-rence on Information and Knowledge Management. New York:ACM Press, 2011:87-92.
收稿日期 2016/10/31
修回日期 2016/12/8
页码 694-699,726
中图分类号 TP311.52
文献标志码 A