《计算机应用研究》|Application Research of Computers

基于文本块密度和标签路径覆盖率的网页正文抽取

Webpage content extraction via text block density and tag path coverage

免费全文下载 (已被下载 次)  
获取PDF全文
作者 刘鹏程,胡骏,吴共庆
机构 合肥工业大学 计算机与信息学院,合肥 230009
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2018)06-1645-06
DOI 10.3969/j.issn.1001-3695.2018.06.010
摘要 大多数网页除了正文信息外,还包括导航、广告和免责声明等噪声信息。为了提高网页正文抽取的准确性,提出了一种基于文本块密度和标签路径覆盖率的抽取方法(CETD-TPC)。结合网页文本块密度特征和标签路径特征的优点,设计了融合两种特征的新特征,利用新特征抽取网页中的最佳文本块,最后,抽取该文本块中的正文内容。该方法有效地解决了网页正文中噪声块信息过滤和短文本难以抽取的问题,且无须训练和人工处理。在CleanEval数据集和从知名网站上随机选取的新闻网页数据集上的实验结果表明,CETD-TPC方法在不同数据源上均具有很好的适用性,抽取性能优于CETR、CETD和CEPR算法。
关键词 正文抽取;文本块密度;标签路径覆盖率;特征融合
基金项目 国家重点研发计划资助项目(2016YFB1000901)
国家自然科学基金资助项目(61273297,61229301,61673152)
国家教育部创新团队发展计划资助项目(IRT13059)
国家留学基金资助项目(201506695019)
本文URL http://www.arocmag.com/article/01-2018-06-010.html
英文标题 Webpage content extraction via text block density and tag path coverage
作者英文名 Liu Pengcheng, Hu Jun, Wu Gongqing
机构英文名 SchoolofComputer&Information,HefeiUniversityofTechnology,Hefei230009,China
英文摘要 Most Webpages contains the content information, as well as noisy information such as navigation, advertisements and disclaimer notices.To address this problem and improve the accuracy of Webpage extraction, this paper proposed a Webpage content extraction method via text block density and tag path coverage (CETD-TPC).Combining the advantages of Webpage text block density feature and tag path feature, this paper designed a new feature, TDTPC, which mixed the two features toge-ther.Then it extracted the best text block from a Webpage by using the TDTPC feature.Finally, it extracted contents from the content block.Without the manual processing and training, CETD-TPC is an effective solution to deal with the problems of noise block information filtering and short text extraction.Experimental results on CleanEval datasets and Web news pages randomly selected from well-known websites show that the CETD-TPC method has good applicability on different data sets and performs better than CETR, CETD and CEPR.
英文关键词 content extraction; text block density; tag path coverage; feature fusion
参考文献 查看稿件参考文献
  [1] Mary M. 2016 Internet trends report[EB/OL] . [2016-07-01] . http://www. kpcb. com/blog/2016-internet-trends-report.
[2] CNNIC. 中国互联网络发展状况统计报告[R] . 北京:中国互联网中心, 2016.
[3] Gibson D, Punera K, Tomkins A. The volume and evolution of Web page templates[C] //Proc of the 14th International Conference on World Wide Web. New York:ACM Press, 2005:830-839.
[4] Rahman A F R, Alam H, Hartono R. Content extraction from HTML documents[C] //Proc of the 1st International Workshop on Web Document Analysis. Berlin:Springer, 2001:1-4.
[5] 郭喜跃, 何婷婷. 信息抽取研究综述[J] . 计算机科学, 2015, 42(2):14-17, 38.
[6] Crescenzi V, Mecca G. Grammars have exceptions[J] . Information Systems, 1998, 23(8):539-565.
[7] Sahuguet A, Azavant F. Building intelligent Web applications using lightweight wrappers[J] . Data & Knowledge Engineering, 2001, 36(3):283-316.
[8] Liu Ling, Pu C, Han Wei. XWRAP:an XML-enabled wrapper construction system for Web information sources[C] //Proc of the 16th International Conference on Data Engineering. Piscataway, NJ:IEEE Press, 2000:611-621.
[9] 李汝君, 张俊, 张晓民, 等. 健康领域Web信息抽取[J] . 计算机应用, 2016, 36(1):163-170.
[10] 孙东普, 朱鸣华, 林鸿飞. 中文专利属性值对抽取技术及应用[J] . 计算机工程与科学, 2016, 38(4):800-806.
[11] Bar-Yossef Z, Rajagopalan S. Template detection via data mining and its applications[C] //Proc of the 11th International Conference on World Wide Web. New York:ACM Press, 2002:580-591.
[12] Yi Lan, Liu Bing, Li Xiaoli. Eliminating noisy information in Web pages for data mining[C] //Proc of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM Press, 2003:296-305.
[13] 顾韵华, 高原, 高宝, 等. 基于模板和领域本体的Deep Web信息抽取研究[J] . 计算机工程与设计, 2014, 35(1):327-332.
[14] 邵堃, 杨春磊, 钱立宾, 等. 基于模式匹配的结构化信息抽取[J] . 模式识别与人工智能, 2014, 27(8):758-768.
[15] Cai Deng, Yu Shipeng, Wen Jirong, et al. VIPS:a vision-based page segmentation algorithm, MSR-TR-2003-79[R] . Redmond:Microsoft , 2003.
[16] 李伟男, 李书琴, 景旭, 等. 基于模拟退火算法和二阶HMM的Web信息抽取[J] . 计算机工程与设计, 2014, 35(4):1264-1268.
[17] Wang Junfeng, Chen Chun, Wang Can, et al. Can we learn a template-independent wrapper for news article extraction from a single training site?[C] //Proc of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM Press, 2009:1345-1354.
[18] 樊梦佳, 段东圣, 杜翠兰, 等. 统计与规则相融合的领域术语抽取算法[J] . 计算机应用研究, 2016, 33(8):2282-2285, 2306.
[19] Weninger T, Hsu W H. Text extraction from the Web via text-to-tag ratio[C] //Proc of the 19th International Workshop on Database and Expert Systems Applications. Washington DC:IEEE Computer Society, 2008:23-28.
[20] Weninger T, Hsu W H, Han Jiawei. CETR:content extraction via tag ratios[C] //Proc of the 19th International Conference on World Wide Web. New York:ACM Press, 2010:971-980.
[21] Sun Fei, Song Dandan, Liao Lejian. Dom based content extraction via text density[C] //Proc of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press, 2011:245-254.
[22] Wu Gongqing, Li Li, Hu Xuegang, et al. Web news extraction via path ratios[C] //Proc of the 22nd ACM International Conference on Information and Knowledge Management. New York:ACM Press, 2013:2059-2068.
[23] Baroni M, Chantree F, Kilgarriff A, et al. CleanEval:a competition for cleaning Web pages[C] //Proc of International Conference on Language Resources and Evaluation. Marrakech:LREC, 2008.
收稿日期 2017/1/13
修回日期 2017/2/24
页码 1645-1650
中图分类号 TP391.1
文献标志码 A