《计算机应用研究》|Application Research of Computers

基于分级匹配的维吾尔语文档相似性计算及剽窃检测方法

Uyghur document similarity calculation and plagiarism detection based on hierarchical matching

免费全文下载 (已被下载 次)  
获取PDF全文
作者 亚森·艾则孜,艾山·吾买尔,阿力木江·艾沙
机构 1.新疆警察学院 信息安全工程系,乌鲁木齐 830011;2.新疆大学 a.信息科学与工程学院;b.网络中心,乌鲁木齐 830046
统计 摘要被查看 次,已被下载
文章编号 1001-3695(2019)06-028-1731-06
DOI 10.19734/j.issn.1001-3695.2017.12.0853
摘要 针对以维吾尔语书写的文档间的相似性计算及剽窃检测问题,提出了一种基于内容的维吾尔语剽窃检测(U-PD)方法。首先,通过预处理阶段对维吾尔语文本进行分词、删除停止词、提取词干和同义词替换,其中提取词干是基于<i>n</i>-gram 统计模型实现;然后,通过BKDRhash算法计算每个文本块的hash值并构建整个文档的hash指纹信息;最后,根据hash指纹信息,基于RKR-GST匹配算法在文档级、段落级和句子级将文档与文档库进行匹配,获得文档相似度,以此实现剽窃检测。通过在维吾尔语文档中的实验评估表明,提出的方法能够准确检测出剽窃文档,具有可行性和有效性。
关键词 维吾尔语文档; 相似度; 剽窃检测; 文档hash指纹; 分级匹配
基金项目 国家自然科学基金资助项目(61762086,61662077,61363064)
国家社会科学基金资助项目(13CFX055)
新疆维吾尔自治区高校科研计划项目(XJEDU2016I052,XJEDU2017M046)
本文URL http://www.arocmag.com/article/01-2019-06-028.html
英文标题 Uyghur document similarity calculation and plagiarism detection based on hierarchical matching
作者英文名 Yasen·Aizezi, Aishan·Wumaier, Alimujiang·Aisha
机构英文名 1.Dept. of Information Security Engineering,Xinjiang Police College,Urumqi 830011,China;2.a.School of Information & Engineering,b.Network Center,Xinjiang University,Urumqi 830046,China
英文摘要 For the issues of the similarity calculation and plagiarism detection from documents written in Uyghur, this paper proposed a content-based Uyghur plagiarism detection(U-PD) method. Firstly, it segmented the Uyghur texts, deleted the stop words, extracted the stems and replaced synonyms through the preprocessing stage, of which extraction stems were based on <i>n</i>-gram statistical models. Then, it calculated the hash value of each text block through the BKDRhash algorithm and constructed the hash fingerprint information of the entire document. Finally, according to the hash fingerprint information, it matched the document and document library at the document level, the paragraph level and the sentence level based on the RKR-GST matching algorithm, and obtained the similarity of the document, so as to realize plagiarism detection. The experimental evaluation in Uyghur documents shows that the proposed method can detect plagiarism documents accurately and is feasible and effective.
英文关键词 Uyghur documents; similarity; plagiarism detection; document hash fingerprinting; hierarchical matching
参考文献 查看稿件参考文献
 
收稿日期 2017/12/20
修回日期 2018/3/9
页码 1731-1736
中图分类号 TP391.1
文献标志码 A