知识组织

基于语义角色标注的文献相似度检测研究

  • 王晓笛 ,
  • 祝娜 ,
  • 白如江 ,
  • 王效岳
展开
  • 山东理工大学科技信息研究所
王晓笛,山东理工大学科技信息研究所硕士研究生;白如江,山东理工大学科技信息研究所讲师;王效岳,山东理工大学科技信息研究所教授。

收稿日期: 2014-04-30

  修回日期: 2014-06-03

  网络出版日期: 2014-06-20

基金资助

本文系国家社会科学基金项目“学术文献'意抄’检测研究”(项目编号:12CTQ032)和山东理工大学人文社会科学发展基金项目“Web信息检索与智能挖掘”研究成果之一。

Research on Literature Similarity Detection Based on Semantic Role Labeling

  • Wang Xiaodi ,
  • Zhu Na ,
  • Bai Rujiang ,
  • Wang Xiaoyue
Expand
  • Institute of Scientific & Technical Information, Shandong University of Technology, Zibo 255049

Received date: 2014-04-30

  Revised date: 2014-06-03

  Online published: 2014-06-20

摘要

利用语义角色标注技术对文献进行标注,以句子为最小单位进行文献的语义相似度检测。提取文献中所有词语的上位词,为每篇文献形成句子-词-语义角色-上位词四部图。语义相似的句子对比参照四部图确定,最终计算出两篇文献相似句子的Jaccard系数作为两篇文献的语义相似度。实验结果表明,所识别出的语义相似度较字粒度Jaccard系数法、词粒度Jaccard系数法、Winnowing Jaccard系数法等高出13%,然而受语料库限制,本方法还有很大的提升空间。

本文引用格式

王晓笛 , 祝娜 , 白如江 , 王效岳 . 基于语义角色标注的文献相似度检测研究[J]. 图书情报工作, 2014 , 58(12) : 130 -135 . DOI: 10.13266/j.issn.0252-3116.2014.12.020

Abstract

In recent years, several academic misconducts have caught the attention of both the academic community and departments concerned which makes similarity detection a hot research point. To cope with semantic plagiarism, researchers begin to study the semantic information. This paper proposes a literature semantic similarity detection method based on semantic role labeling. First a paper is labeled using a SRL tool. Sentence granularity is used. Hypernyms were extracted using a semantic dictionary. Every paper is represented by a sentence-term-semantic role-hypernym 4-partite graph. Sentence comparison refers to the 4-partite graph. Jaccard coefficient is computed to represent the similarity between two papers. Due to the confinement of SRL tools, the result of semantic similarity detection is not agreeable. Even so it is still 13% higher than other methods.

参考文献

[1] McCabe D L. Cheating among college and university students: A North American perspective[J].International Journal for Educational Integrity, 2005, 1(1):1-11.

[2] Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis[J].JASIS, 1990, 41(6): 391-407.

[3] García-Molina H, Gravano L, Shivakumar N. dSCAM: Finding document copies across multiple databases[C]//Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems. IEEE, 1996: 68-79.

[4] Manber U. Finding similar files in a large file system[C]//Proceedings of the Winter USENIX Technical Conference. San Francisco:USENIX Association,1994: 1-10.

[5] Zobel J, Moffat A. Exploring the similarity space[J].ACM SIGIR Forum., 1998, 32(1): 18-34.

[6] Schleimer S, Wilkerson D S, Aiken A. Winnowing:Local algorithms for document fingerprinting[C]//Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2003: 76-85.

[7] Chowdhury A, Frieder O, Grossman D, et al. Collection statistics for fast duplicate document detection[J].ACM Transactions on Information Systems, 2002, 20(2): 171-191.

[8] Hoad T C, Zobel J. Methods for identifying versioned and plagiarized documents[J].Journal of the American society for information science and technology, 2003, 54(3): 203-215.

[9] Miller G A. WordNet:A lexical database for English[J].Communications of the ACM, 1995, 38(11): 39-41.

[10] Alzahrani S, Salim N. Fuzzy semantic-based string similarity for extrinsic plagiarism detection[C]//Proceedings of CLEF.Padua:2010:22-28.

[11] Kent C K, Salim N. Web based cross language plagiarism detection[C]//Proceedings of the 2010 Second International Conference on Computational Intelligence, Modelling and Simulation. Skudai:IEEE, 2010: 199-204.

[12] Baker C F, Fillmore C J, Lowe J B. The Berkeley framenet project[C]//Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Stroudsburg:Association for Computational Linguistics, 1998: 86-90.

[13] Kingsbury P, Palmer M. From TreeBank to PropBank[C]// Proceedings of the Third International Conference on Language Resources and Evaluation. Las Palmas: 2002.

[14] Schuler K K. VerbNet: A broad-coverage, comprehensive verb lexicon[D]. Philadelphia: University of Pennsylvania,2005.

[15] Fillmore C J. Toward a modern theory of case[M]. Ohio State University Press, 1966.

[16] Gildea D, Jurafsky D. Automatic labeling of semantic roles[J].Computational Linguistics, 2002, 28(3): 245-288.

[17] Chang Chih-Chung, Lin Chih-Jen. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 1-27.

[18] Osman A H, Salim N, Binwahlan M S, et al. An improved plagiarism detection scheme based on semantic role labeling[J].Applied Soft Computing, 2012, 12(5): 1493-1502.

[19] Furlan B, Batanovi V, Nikolic B. Semantic similarity of short texts in languages with a deficient natural language processing support[J]. Decision Support Systems, 2013, 55(3): 710-719.

[20] Potthast M, Gollub T, Hagen M, et al. Overview of the 4th International Competition on Plagiarism Detection[C]// Notebook Papers of CLEF 2012 LABs and Workshops. Rome:2012:1-28.

[21] 糖尿病手术十大疑问:手术如何降血糖?访上海第二军医大学长海医院内分泌科主任邹大进教授哈尔滨医科大学附属第二医院普外科孙世波教授[J].糖尿病文摘,2013(11):16-18.

[22] Manning C D, Raghavan P, Schütze H. Introduction to information retrieval[M].Cambridge: Cambridge University Press, 2008.

文章导航

/