知识组织

面向科技文献的混合语义信息抽取方法研究

  • 冷伏海 ,
  • 白如江 ,
  • 祝清松
展开
  • 1. 中国科学院国家科学图书馆;
    2. 山东理工大学图书馆
冷伏海,中国科学院国家科学图书馆研究员;白如江,中国科学院国家科学图书馆,山东理工大学图书馆讲师,E-mail:brj@sdut.edu.cn;祝清松,中国科学院国家科学图书馆博士研究生。

收稿日期: 2013-04-15

  修回日期: 2013-05-10

  网络出版日期: 2013-06-05

A Hybrid Semantic Information Extraction Methodfor Scientific Research Papers

  • Leng Fuhai ,
  • Bai Rujiang ,
  • Zhu Qingsong
Expand
  • 1. Chinese Academy of Sciences, National Science Library, Beijing 100190;
    2. Shandong University of Technology Library, Zibo 255049

Received date: 2013-04-15

  Revised date: 2013-05-10

  Online published: 2013-06-05

摘要

针对目前知识抽取技术无法精确抽取学术文献中提及的具体理论方法和性能指标参数等问题,综合运用语义标注技术、规则抽取技术以及正则表达式技术,提出一种面向科技文献的混合语义信息抽取方法。该方法首先对科技文献进行语义标注,得到相关学术术语。然后,构造抽取规则,抽取文献提及的与具体性能指标相关的句子。最后,采用正则表达式技术从相关句子中精确抽取出关键性能指标。对碳纳米管研究领域科技文献语义的信息抽取证明,该方法能迅速、有效和准确地抽取科技文献主要创新研究内容和性能指标。

本文引用格式

冷伏海 , 白如江 , 祝清松 . 面向科技文献的混合语义信息抽取方法研究[J]. 图书情报工作, 2013 , 57(11) : 112 -119 . DOI: 10.7536/j.issn.0252-3116.2013.11.021

Abstract

Knowledge extraction techniques can not accurately extract specific theoretical approaches and performance indicators parameters mentioned in the academic literature. This paper proposed a hybrid semantic extract method to address this problem mentioned above. The proposed method combined semantic tagging technology, rule extraction technology and regular expression technology to accurately extract the relevant information from scientific literature. Firstly, semantic annotation technology was used to obtain relevant academic terms. Then, construct specific extraction rules to extract sentences associated with the performance indicators. Finally, regular expressions technology was used to accurately extract the parameters of the key performance indicators. Experiment in the field of carbon nanotube research proved that this method can rapidly, efficiently and accurately extract the scientific literature innovative research and the indicators.

参考文献

[1] Grishman R.Information extraction:Techniques and challenges[R].New York: New York University Press,1997.
[2] Message Understanding Conference (MUC) [EB/OL]. [2012-12-16].http://www.itl.nist.gov/iaui/894.02/related_projects/muc/.
[3] Automatic Content Extraction(ACE)evaluation[EB/OL]. [2012-12-16].http://www.itl.nist.gov/iad/mig//tests/ace/.
[4] Text Analysis Conference[EB/OL]. [2012-12-16]. http://www.nist.gov/tac/.
[5] Appelt D E,Onyshkevych B. The Common Pattern Specification Language[C]//Association for Computational Linguistics. Proceedings of a Workshop on TIPSTER. Stroudsburg:ACM,1998:23-30.
[6] Cunningham H.JAPE:A Java annotation patterns engine[EB/OL].[2013-04-13]. http://www.dcs.shef.ac.uk/intranet/research/public/resmes/CS0010.pdf.
[7] Boguraev B. Annotation-based finite state processing in a large-scale NLP architecture[C]//Nicolov N. Recent Advances in Natural Language Processing.Amsterdam:John Benjamins Publishing,2004:61-63.
[8] Zhao S,Grishman R. Extracting relations with integrated information using kernel methods[C]//Association for Computational Linguistics. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computing Machinery,2005:419-426.
[9] Agichtein E,Gravano L.Snowball: Extracting relations from large plain text collections[C]//ACM Conference on Digital libraries. Proceedings of the fifth ACM conference on Digital libraries. New York: Association for Computing Machinery, 2000:85-94.
[10] Yates A,Banko M,Broadhead M,et al. TextRunner:Open information extraction on the Web[C]//Association for Computational Linguistics. Proceedings of Human Language Technologies: The Annual Conference of the North American. New York: Association for Computing Machinery, 2007:25-26.
[11] Soderland S G.Learning text analysis rules for domain-specific natural language processing[EB/OL].[2013-04-13]. http://www.cs.washington.edu/homes/soderlan/Thesis.ps.gz.
[12] Lafferty J,McCallum A,Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Association for Computing Machinery. Proceedings of the 18th International Conference on Machine Learning 2001. Stroudsburg: Association for Computing Machiner,2001:282-289.
[13] Peng F,McCallum A. Accurate information extraction from research papers using conditional random fields[EB/OL].[2013-04-03]. http://acl.ldc.upenn.edu/N/N04/N04-1042.pdf.
[14] Freitag D. Multistrategy learning for information extraction[C]//Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers,1998:161-169.
[15] Staab S,Maedche A,Handschuh S. An annotation framework for the semantic Web [C]//Proceedings of the First Workshop on Multimedia Annotation, Tokyo,2001.
[16] Kahan J,Koivunen M R,Prud’Hommeaux E,et al. Annotea: An open RDF infrastructure for shared Web annotations[C]//World Wide Web Consortium. Proceedings of the Tenth International World Wide Web Conference. New York:ACM,2001:623-632.
[17] Heflin J,Hendler J. Searching the Web with SHOE[EB/OL].[2013-04-17]. http://www.cs.kun.nl/is/Library/Data/2000/Heflin/Searching/2000-Heflin-Searching.pdf.
[18] Sheth A,Bertram C,Avant D,et al. Managing semantic content for the Web [J]. IEEE Internet Computing,2002,6(4):80-87.
[19] 黄泽武. 基于语义的科技文献共享平台的信息抽取系统[D].武汉:华中科技大学,2007.
[20] 于亮. 科技文献的文本特征抽取研究与应用[D].北京:北京邮电大学,2009.
[21] 何新贵,彭甫阳.中文文本的关键词自动抽取和模糊分类.中文信息学报,1998,13(1):10-16
[22] 何婷婷,许婷,瞿国忠,等.基于主题词对的文档重排方法.计算机工程与应用, 2007,43(11):161-163.
[23] 侯跃芳,崔雷,朱利娜.应用主题词/副主题词关联规则对专题知识的挖掘分析及评价.情报理论与实践,2008(2):234-236.
[24] 赵军,刘康,周光有,等.开放式文本信息抽取[J]. 中文信息学报,2011(6):98-110.
[25] 孙荣,周文,刘宗田.用规则抽取句子中事件信息[J]. 小型微型计算机系统,2011(11):2309-2314.
[26] 胡军伟,秦奕青,张伟.正则表达式在Web信息抽取中的应用[J]. 北京信息科技大学学报(自然科学版),2011(6):86-89.
[27] 黄先珍,杨玉珍,刘培玉. 信息过滤中基于统计与规则的关键词抽取研究[J]. 计算机工程,2012(2):57-59.
[28] 黄九鸣,吴泉源,刘春阳,等. 短文本信息流的无监督会话抽取技术[J]. 软件学报,2012(4):735-747.
[29] 朱玲玲,杨爱琴,魏晓宁. 中文自由短文本信息抽取方法的研究[J]. 电脑知识与技术,2012(15):3691-3692.
[30] Ahmed Z.Domain specific information extraction for semantic annotation[D].Prague:Charles University in Prague,2009.
[31] 温有奎,温浩.关键词与创新点词句群分布分析[J].情报学报,2007,26(1):50-55.
[32] 温有奎,温浩,徐端颐,等.基于创新点的知识元挖掘[J].情报学报,2005,24(6):663-668.
[33] 刘剑兰,朱东华.信息抽取技术在情报监测中的应用[J].情报学报,2004,23(6):661-666.
[34] 裘江南,罗志成,王延章,等.基于词汇链的应急预案主题抽取方法研究[J].情报学报,2008,27(6):891-896.
[35] 丁晟春,刘逶迤,熊霞,等.基于领域本体和语块分析的信息抽取的研究与实现[J].情报学报,2010,29(1):53-58.

文章导航

/