[Purpose/significance] The abstract of scientific papers is a vital indexing object within information organization. Meanwhile, indexing the abstract according to certain rules is conducive for not only scientific communication or knowledge discovery, and intelligence analysis as well. Thus, how to realize auto-index accurately and quickly, for millions of unstructured abstracts existed nowadays is a crucial problem to be addressed.[Method/process] This study assumed that different categories of abstract are inherently consistent, that is, the study of structured abstract can provide a method and technical reference for unstructured abstract auto-indexing. Acting in accordance with this assumption and based on the US National Library of Medicine's structural element labeling terminology, this study accomplished mapping across abstract element classifications and proposed BOMRC system, a normalization indexing method for structured abstract. Then we collected research sample and used text mining method to analyze multiple features of structured abstract quantitatively and statistically, such as word frequency, TF-IDF value, as for dimension of words, verbs, three-word lexical chunks and four-word lexical chunks, which enabled us propose a semantic feature dictionary for structured elements. Finally, we used unstructured abstract to test the validity of the semantic feature dictionary.[Result/conclusion] The results show that the semantic feature dictionary method can effectively identify various structural elements of scientific paper abstract, and it can be used to optimize the automatic recognition model, which may be based on machine learning methods.
Song Donghuan
,
Li Chenying
,
Liu Ziyu
,
Han Mingjie
. Semantic Feature Dictionary Construction of Abstract in English Scientific Journals[J]. Library and Information Service, 2020
, 64(6)
: 108
-119
.
DOI: 10.13266/j.issn.0252-3116.2020.06.013
[1] ERTL N. New way of documenting scientific data from medical publications[J]. Karger gazette,1969,27(20):1-3.
[2] 曹雁,牟爱鹏.科技期刊英文摘要学术词汇的语步特点研究[J].外语学刊,2011(3):46-49.
[3] DAY R A, SAKADUSKI N. Scientific English:a guide for scientists and other professionals[M].Phoenix,AZ:Oryx,1998:109-125.
[4] 钱多秀, 罗媛. 基于语料库的论文摘要语步的对比研究[J].北京科技大学学报(社会科学版),2014,30(2):12-17.
[5] GRATEZ N. Teaching EFL students to extract structural information from abstracts[M]. Belgium:ACCO,1985:123-135.
[6] SWALES J M. Genre analysis:English in academic and research settings[D].Cambridge:Cambridge University Press,1990.
[7] TSENG F. Analyses of move structure and verb tense of research article abstracts in applied linguistics[J].International journal of English linguistics,2011,1(2):27-39.
[8] 李涛.科技论文的英文摘要规范化问题研究——以自然科学论文为例[J].辽宁工业大学学报(社会科学版),2018,20(6):70-73.
[9] 周志超.中文图情期刊摘要的核心要素与逻辑结构分析[J].情报科学,2018,36(3):8-12,32.
[10] DAHL T. Lexical cohesion-based text condensation:an evaluation of automatically produced summaries of research articles by comparison with author-written abstracts[D]. Bergen:University of Bergen,2000.
[11] HAYNES R B. A proposal for more informative abstracts of clinical articles[J]. Annals of internal medicine, 1987, 106(4):598-604.
[12] NILSEN D L F, NILSEN A P. Semantic theory:a linguistic perspective[M].Massachusetts:Newbury House Publishers,1975:1-20.
[13] ANTHONY L E.A machine learning system for the automatic identification of text structure, and application to research article abstracts in computer science[D]. Birmingham:Birmingham University,2002.
[14] KIM S N, MARTINEZ D, CAVEDON L, et al. Automatic classification of sentences to support evidence based medicine[J].BMC bioinformatics,2011,12(2):1-10.
[15] FELTRIM V D, TEUFEL S. Automatic critiquing of novices' scientific writing using argumentative zoning[C]//Proceedings of AAAI spring symposium on exploring attitude and affect in text:theories and applications. 2004,3:1-4.
[16] SILVA J, COHEUR L, MENDES A C, et al. From symbolic to sub-symbolic information in question classification[J]. Artificial intelligence review, 2011, 35(2):137-154.
[17] MEENA Y K, GOPALANI D. Feature priority based sentence filtering method for extractive automatic text summarization[J].Procedia computer science, 2015, 48(1):728-734.
[18] GUO Y, KORHONEN A, LIAKATA I M, et al. Identifying the information structure of scientific abstracts:an investigation of three different schemes[C]//Proceedings of the 2010 workshop on biemedical natural language processings. Association for Computational Linguistics,2010:99-107.
[19] 沈思,胡昊天,叶文豪,等.基于全字语义的摘要结构功能自动识别研究[J].情报学报,2019,38(1):79-88.
[20] U S National Library of Medcine. The NLM label list and category mappings[EB/OL].[2020-01-02].https://structuredabstracts.nlm.nih.gov/.
[21] 王立非,刘霞.英语学术论文摘要语步结构自动识别模型的构建[J].外语电化教学,2017(2):45-50,64.