《计算机应用研究》|Application Research of Computers

基于开放域抽取的多文档概念图构建研究

Multi-document conceptual graph construction research based on open domain extraction

免费全文下载 (已被下载 次)  
获取PDF全文
作者 盛泳潘,付雪峰,吴天星
机构 1.电子科技大学 计算机科学与工程学院,成都 611731;2.南昌工程学院 信息工程学院,南昌 330099;3.东南大学 计算机科学与工程学院,南京 211189
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2020)01-004-0019-07
DOI 10.19734/j.issn.1001-3695.2018.05.0454
摘要 在信息过载的背景下,如何从拥有共同主题的多篇文档中挖掘并组织核心概念及其语义连接已成为当前信息抽取任务中的一项重要挑战。为此,提出了一种新颖的基于开放域抽取的多文档概念图构建方法。首先基于预定主题挖掘主题词,通过改进的TF-IDF算法对文档进行排序;然后通过共指消解、篇章权重计算、三元组实例抽取等一系列步骤从多篇文章中抽取出大量具有事实表达能力的三元组实例。为去除开放域方法本身的噪声以及提高信息抽取的准确率,提出一种三元组实例过滤算法。通过该算法可有效提取高置信度且具有良好语义兼容性的显著关系实例集合,并构成多个概念子图。最后,将不同子图中的等价概念以及关系进行合并,形成一张具有较好主题表达能力的连通概念图。通过在signal media新闻数据集上进行验证,实验结果表明,所提出的方法能够跨文档组织重要的主题信息,形成的概念图在主题概念覆盖率、关系实例的兼容性等指标上均取得了较好的效果。在实际的应用场景中,概念图作为一种重要的多文档内容表现形式,对于用户进一步探索指定主题的发展脉络以及生成自动文档摘要均具有重要的参考价值。
关键词 开放域抽取; 多文档; 概念图构建
基金项目 国家自然科学基金资助项目(61762063)
江西省自然科学基金资助项目(20171BAB202024)
江西省教育厅科研项目(GJJ170991)
国家建设高水平大学公派研究生项目(201706070049)
本文URL http://www.arocmag.com/article/01-2020-01-004.html
英文标题 Multi-document conceptual graph construction research based on open domain extraction
作者英文名 Sheng Yongpan, Fu Xuefeng, Wu Tianxing
机构英文名 1.School of Computer Science & Engineering,University of Electronic Science & Technology of China,Chengdu 611731,China;2.School of Information Engineering,Nanchang Institute of Technology,Nanchang 330099,China;3.School of Computer Science & Engineering,Southeast University,Nanjing 211189,China
英文摘要 In the background of information overload, this is challenging to mine and organize meaningful concepts and their semantic connections from a set of related documents under the same topic in information extraction. Thus, this paper proposed a novel multi-document conceptual graph construction method based on open-domain information extraction. Firstly, documents were ranked according to the improved TF-IDF weight of extracted topic words under the predefined topics, then the method relayed on a serious of methods, including coreference resolution, weight computation, triple instance extraction steps, to extract numerous representative subject-predicate-object triples from multiple documents. For filtering out the noise of open-domain information approach itself and improving the accuracy of information extraction, this paper presented a triple filtering algorithm to retain only the most salient, confident and compatible triples, which can form multiple conceptual subgraphs. Finally, in combined with the equivalent concepts and relationships across different subgraphs to connect into a fully connected conceptual graph. Experiments on signal media dataset illustrate that the proposed method has the capacity to discern key topic information corresponds to the specific topic within and across documents, and the formed conceptual graph achieves the good performance in terms of the coverage rate of topic concepts as well as the compatible triples. In actual circumstance, conceptual graph can be regarded as an important representation form of multiple documents and has the important significance for further exploring advance of the topic and generating automatic document abstraction.
英文关键词 open-domain extraction; multiple documents; conceptual graph construction
参考文献 查看稿件参考文献
 
收稿日期 2018/5/23
修回日期 2018/8/6
页码 19-25
中图分类号 TP391
文献标志码 A