[目的/意义] 论文摘要是信息组织的重要标引对象,将论文摘要按一定结构进行标引有利于科学传播、知识发现和情报分析。如何对现有非结构式摘要进行精准快速的自动标引是亟待解决的现实问题。[方法/过程] 假定不同类别的摘要具有内在一致性,即对结构式摘要的研究可为非结构式摘要自动标引提供方法和技术参考。据此,基于美国国家医学图书馆结构要素标签术语集和标签分类映射关系,提出结构要素BOMRC体系和结构式摘要的识别与规范化标引方法。其次选取研究样本并采用文本挖掘方法对样本语料中的单词、动词、三词词块、四词词块等词汇进行词频、TFIDF值等多个指标的定量统计分析,构建能够进行结构要素识别的语义特征词典。最后利用非结构式摘要测试集进行语义特征词典有效性检验。[结果/结论] 结果显示,利用语义特征词典方法能够有效识别非结构式摘要的各类要素,并可用于优化以机器学习方法为核心的自动识别模型。
[Purpose/significance] The abstract of scientific papers is a vital indexing object within information organization. Meanwhile, indexing the abstract according to certain rules is conducive for not only scientific communication or knowledge discovery, and intelligence analysis as well. Thus, how to realize auto-index accurately and quickly, for millions of unstructured abstracts existed nowadays is a crucial problem to be addressed.[Method/process] This study assumed that different categories of abstract are inherently consistent, that is, the study of structured abstract can provide a method and technical reference for unstructured abstract auto-indexing. Acting in accordance with this assumption and based on the US National Library of Medicine's structural element labeling terminology, this study accomplished mapping across abstract element classifications and proposed BOMRC system, a normalization indexing method for structured abstract. Then we collected research sample and used text mining method to analyze multiple features of structured abstract quantitatively and statistically, such as word frequency, TF-IDF value, as for dimension of words, verbs, three-word lexical chunks and four-word lexical chunks, which enabled us propose a semantic feature dictionary for structured elements. Finally, we used unstructured abstract to test the validity of the semantic feature dictionary.[Result/conclusion] The results show that the semantic feature dictionary method can effectively identify various structural elements of scientific paper abstract, and it can be used to optimize the automatic recognition model, which may be based on machine learning methods.
[1] ERTL N. New way of documenting scientific data from medical publications[J]. Karger gazette,1969,27(20):1-3.
[2] 曹雁,牟爱鹏.科技期刊英文摘要学术词汇的语步特点研究[J].外语学刊,2011(3):46-49.
[3] DAY R A, SAKADUSKI N. Scientific English:a guide for scientists and other professionals[M].Phoenix,AZ:Oryx,1998:109-125.
[4] 钱多秀, 罗媛. 基于语料库的论文摘要语步的对比研究[J].北京科技大学学报(社会科学版),2014,30(2):12-17.
[5] GRATEZ N. Teaching EFL students to extract structural information from abstracts[M]. Belgium:ACCO,1985:123-135.
[6] SWALES J M. Genre analysis:English in academic and research settings[D].Cambridge:Cambridge University Press,1990.
[7] TSENG F. Analyses of move structure and verb tense of research article abstracts in applied linguistics[J].International journal of English linguistics,2011,1(2):27-39.
[8] 李涛.科技论文的英文摘要规范化问题研究——以自然科学论文为例[J].辽宁工业大学学报(社会科学版),2018,20(6):70-73.
[9] 周志超.中文图情期刊摘要的核心要素与逻辑结构分析[J].情报科学,2018,36(3):8-12,32.
[10] DAHL T. Lexical cohesion-based text condensation:an evaluation of automatically produced summaries of research articles by comparison with author-written abstracts[D]. Bergen:University of Bergen,2000.
[11] HAYNES R B. A proposal for more informative abstracts of clinical articles[J]. Annals of internal medicine, 1987, 106(4):598-604.
[12] NILSEN D L F, NILSEN A P. Semantic theory:a linguistic perspective[M].Massachusetts:Newbury House Publishers,1975:1-20.
[13] ANTHONY L E.A machine learning system for the automatic identification of text structure, and application to research article abstracts in computer science[D]. Birmingham:Birmingham University,2002.
[14] KIM S N, MARTINEZ D, CAVEDON L, et al. Automatic classification of sentences to support evidence based medicine[J].BMC bioinformatics,2011,12(2):1-10.
[15] FELTRIM V D, TEUFEL S. Automatic critiquing of novices' scientific writing using argumentative zoning[C]//Proceedings of AAAI spring symposium on exploring attitude and affect in text:theories and applications. 2004,3:1-4.
[16] SILVA J, COHEUR L, MENDES A C, et al. From symbolic to sub-symbolic information in question classification[J]. Artificial intelligence review, 2011, 35(2):137-154.
[17] MEENA Y K, GOPALANI D. Feature priority based sentence filtering method for extractive automatic text summarization[J].Procedia computer science, 2015, 48(1):728-734.
[18] GUO Y, KORHONEN A, LIAKATA I M, et al. Identifying the information structure of scientific abstracts:an investigation of three different schemes[C]//Proceedings of the 2010 workshop on biemedical natural language processings. Association for Computational Linguistics,2010:99-107.
[19] 沈思,胡昊天,叶文豪,等.基于全字语义的摘要结构功能自动识别研究[J].情报学报,2019,38(1):79-88.
[20] U S National Library of Medcine. The NLM label list and category mappings[EB/OL].[2020-01-02].https://structuredabstracts.nlm.nih.gov/.
[21] 王立非,刘霞.英语学术论文摘要语步结构自动识别模型的构建[J].外语电化教学,2017(2):45-50,64.