《计算机应用研究》|Application Research of Computers

文本挖掘中一种基于参数估计的语句分块方案研究

Research on sentence chunking scheme based on parameter estimation in text mining

免费全文下载 (已被下载 次)  
获取PDF全文
作者 梁凤兰
机构 宿迁学院 计算机系,江苏 宿迁 223800
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2015)04-0986-06
DOI 10.3969/j.issn.1001-3695.2015.04.006
摘要 若想从文本中获得高质量信息,一般来讲需要处理大量数据集,还需使用自然语言处理方法及参数估计统计模型。针对该问题,首先针对数据遵守幂律分布的统计参数估计模型进行了优化;然后提出一种统计学方法用于文本挖掘中的语句分块,通过迭代估计词组概率,将大型语料库中的语句分成更小的有意义词组。该方法要求生成并存储大量词组频率数据,并在每次迭代时支持计算节点快速访问数据。实验评估表明,该方案显著降低了远程数据库查询次数,其端到端应用运行时间要比只基于HBase的原始分布式部署快出6倍。
关键词 数据集;参数估计;文本挖掘;幂律;词组;运行时间
基金项目 国家自然科学基金面上项目(61173051/F020104)
本文URL http://www.arocmag.com/article/01-2015-04-006.html
英文标题 Research on sentence chunking scheme based on parameter estimation in text mining
作者英文名 LIANG Feng-lan
机构英文名 Dept. of Computer, College of Suqian, Suqian Jiangsu 223800, China
英文摘要 Deriving high-quality information from text generally involves working with large data sets, and requires the use of natural language processing methodologies along with statistical models for parameter estimation. To solve this problem, this paper firstly optimized the statistical parameter estimation models where the data followed a power-law distribution. Secondly, it introduced a statistical method for the sentence chunking problem in text mining, which divided sentences in a large corpus into smaller and meaningful phrases by estimating phrase probabilities in an iterative fashion. This method required generating and storing a massive amount of phrase frequency data and enabling rapid access to it from the compute nodes at each iteration. Finally, experimental evaluation shows that the multi-layered architecture significantly reduces the number of remote database queries, yielding up to six times faster end-to-end application run time than a naive distributed implementation using HBase alone.
英文关键词 data sets; parameter estimation; text mining; power-law; phrases; run time
参考文献 查看稿件参考文献
  [1] 张爱科, 符保龙. 基于高维聚类的探索性文本挖掘算法[J] . 计算机应用, 2013, 33(4):988-990.
[2] 赵保学, 李战怀, 陈群, 等. 基于共享的 MapReduce 多查询优化技术[J] . 计算机应用研究, 2013, 30(5):1405-1409.
[3] 朱蔷蔷, 张桂芸, 刘文龙. 基于 MapReduce 框架一种文本挖掘算法的设计与实现[J] . 郑州大学学报:工学版, 2012, 33(5):110-113.
[4] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[J] . Communications of the ACM, 2012, 51(1):107-113.
[5] ELSAYED T, LIN J, OARD D W. Pairwise document similarity in large collections with MapReduce[C] //Proc of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies. [S. l. ] :Association for Computational Linguistics, 2008:265-268.
[6] KANG U, TSOURAKAKIS C E, FALOUTSOS C. PEGASUS:a peta-scale graph mining system implementation and observations[C] //Proc of the 9th IEEE International Conference on Data Mining. Washington DC:IEEE Computer Society, 2009:229-238. [7] EKANAYAKE J, PALLICKARA S, FOX G. Mapreduce for data intensive scientific analyses[C] //Proc of the 4th IEEE International Conference on eScience. 2008:277-284.
[8] LIN J, BAHETY A, KONDA S, et al. Low-latency, high-throughput access to static global resources within the Hadoop framework, HCIL-2009-01[R] . Maryland:University of Maryland, 2013:1211-1228.
[9] BRANTS T, POPAT A C, XU Peng, et al. Large language models in machine translation[C] //Proc of Joint Conference on Empircal Me-thods in Natural Language Processing. 2007.
[10] 姜芳艽. 基于 Zipf 分布与属性相关性的选择性估计[J] . 计算机科学, 2010, 37(11):184-189.
[11] SABATTI C, LANGE K. Genomewide motif identification using a dictionary model[J] . Proceedings of the IEEE, 2002, 90(11):1803-1810.
[12] CROFT W B, METZLER D, STROHMAN T. Search engines:information retrieval in practice[M] . Boston:Addison-Wesley, 2010.
[13] DELWICHE F A. Searching MEDLINE via PubMed[J] . Clinical Laboratory Science:Journal of the American Society for Medical Technology, 2007, 21(1):35-41.
收稿日期 2014/3/12
修回日期 2014/5/7
页码 986-991,995
中图分类号 TP391.1
文献标志码 A