[Purpose/Significance] Given the significant role of the move structure in academic papers for enabling readers to deeply understand the content and rapidly locate key information, this study aims to investigate methods for full-text move recognition, to quickly capture the core content of academic papers, thereby advancing intelligent semantic retrieval. [Method/Process] The article reviewed current studies on move recognition methods and, on this basis, proposed a fine-grained move recognition model, the SciBERT-HAMI, which integrated ChatGPT data augmentation and a pre-trained language model. This model employed original texts and corpus augmentation via the ChatGPT large model, to enhance the variety and volume of the training data. A hierarchical neural network model was adopted to learn the paper’s semantic feature representations at the “word-sentence-section” levels, to capture semantic information at varied levels. The SciBERT word embedding representations were inputted, and the model was trained using a hierarchical neural network with the FocalLoss loss function for fine-grained move recognition. [Result/Conclusion] Integrating ChatGPT data augmentation strategies, the SciBERT-HAMI-DA model achieve F1 scores of 73.1% and 74.1% on the CoreSC and AZ datasets, respectively. Comparative experiments demonstrate that the proposed model shows effective performance improvement in the task of fine-grained move recognition in full-text academic papers, and its effectiveness is verified through ablation experiments. By integrating pre-trained language models and ChatGPT data augmentation, the prediction effect of the full-text move recognition model is effectively improved, which helps to promote the automation and intelligence of academic research.
[1] BURROUGH-BOENISCH J. International reading strategies for IMRD articles[J]. Written communication, 1999, 16(3): 296-316.
[2] SWALES J M. Aspects of article introductions[M]. Michigan: University of Michigan Press, 2011.
[3] HOUNGBO H, MERCER R E. Method mention extraction from scientific research papers[C]//Proceedings of COLING 2012. Mumbai: The COLING 2012 Organizing Committee, 2012: 1211-1222.
[4] 化柏林.学术论文中方法知识元的类型与描述规则研究[J]. 中国图书馆学报, 2016, 42(1): 30-40. (HUA B L. Types and description rules of knowledge elements about method in academic papers[J]. Journal of library science in China, 2016, 42(1): 30-40.)
[5] DAYRELL C, CANDIDO JR A, LIMA G, et al. Rhetorical move detection in English abstracts: multi-label sentence classifiers and their annotated corpora[C]//Proceedings of the eighth international conference on language resources and evaluation. Istanbul: European Language Resources Association, 2012: 1604-1609.
[6] GUO Y, SILINS I, STENIUS U, et al. Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review[J]. Bioinformatics, 2013, 29(11): 1440-1447.
[7] 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019, 3(12): 1-9. (ZHANG Z X, LIU H, DING L P, et al. Identifying moves of research abstracts with deep learning methods[J]. Data analysis and knowledge discovery, 2019, 3(12): 1-9.)
[8] BANERJEE S, SANYAL D K, CHATTOPADHYAY S, et al. Segmenting scientific abstracts into discourse categories: a deep learning-based approach for sparse labeled data[C]//Proceedings of the ACM/IEEE joint conference on digital libraries in 2020. New York: Association for Computing Machinery, 2020: 429-432.
[9] YAMADA K, HIRAO T, SASANO R, et al. Sequential span classification with neural semi-Markov crfs for biomedical abstracts[C]//Findings of the association for computational linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020: 871-877.
[10] GOSANGI R, ARORA R, GHEISARIEHA M, et al. On the use of context for predicting citation worthiness of sentences in scholarly articles[C]//Proceedings of the 2021 conference of the north American chapter of the association for computational linguistics: human language technologies. Online: Association for Computational Linguistics, 2021: 4539-4545.
[11] ASIF S, KHAN A. An efficient hybrid bi-LSTM attention model for claims extraction from research articles through deep learning[J]. International journal of emerging multidisciplinaries: computer science & artificial intelligence, 2022, 1(2): 84-96.
[12] BRACK A, HOPPE A, BUSCHERMÖHLE P, et al. Cross-domain multi-task learning for sequential sentence classification in research papers[C]//Proceedings of the 22nd ACM/IEEE joint conference on digital libraries. New York: Association for Computing Machinery, 2022: 1-13.
[13] ZHANG C, XIANG Y, HAO W, et al. Automatic recognition and classification of future work sentences from academic articles in a specific domain[J]. Journal of informetrics, 2023, 17(1): 101373.
[14] TOKALA Y S S S, ALURU S S, VALLABHAJOSYULA A, et al. Label informed hierarchical transformers for sequential sentence classification in scientific abstracts[J]. Expert systems, 2023, 40(6): e13238.
[15] 杜新玉, 李宁.中文学术论文全文语步识别研究[J]. 数据分析与知识发现, 2024, 8(2): 74-83. (DU X Y, LI N. Research on recognition of moves in full-text Chinese academic papers[J]. Data analysis and knowledge discovery, 2024, 8(2): 74-83.)
[16] KUMAR A, BHATTAMISHRA S, BHANDARI M, et al. Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation[C]//Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies. Minneapolis: Association for Computational Linguistics, 2019: 3609-3619.
[17] ANABY-TAVOR A, CARMELI B, GOLDBRAICH E, et al. Do not have enough data? Deep learning to the rescue![C]//Thirty-fourth AAAI conference on artificial intelligence. New York: AAAI Press, 2020: 7383-7390.
[18] QUTEINEH H, SAMOTHRAKIS S, SUTCLIFFE R. Textual data augmentation for efficient active learning on tiny datasets[C]//Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). Online: Association for Computational Linguistics, 2020: 7400-7410.
[19] KWON S, LEE Y. Explainability-based mix-up approach for text data augmentation[J]. ACM transactions on knowledge discovery from data, 2023, 17(1): 1-14.
[20] OpenAI. ChatGPT: Optimizing language models for dialogue[EB/OL]. [2024-04-03]. https://openai.com/blog/chatgpt/.
[21] WEI X, CUI X, CHENG N, et al. Zero-shot information extraction via chatting with ChatGPT [J]. arXiv preprint arXiv:2302.10205, 2023.
[22] YANG X, LI Y, ZHANG X, et al. Exploring the limits of ChatGPT for query or aspect-based text summarization[J]. arXiv preprint arXiv:2302.08081, 2023.
[23] WANG J, LIANG Y, MENG F, et al. Cross-lingual summarization via ChatGPT [J]. arXiv preprint arXiv:2302.14229, 2023.
[24] ZHENG Z, ZHANG O, BORGS C, et al. ChatGPT chemistry assistant for text mining and the prediction of MOF synthesis[J]. Journal of the American Chemical Society, 2023, 145(32): 18048-18062.
[25] DAI H, LIU Z, LIAO W, et al. AugGPT: Leveraging ChatGPT for text data augmentation[J]. arXiv preprint arXiv:2302.13007, 2023.
[26] BELTAGY I, LO K, COHAN A. SciBERT: A pretrained language model for scientific text[J]. arXiv preprint arXiv:1903.10676, 2019.
[27] JIN D, SZOLOVITS P. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts[J]. arXiv preprint arXiv:1808.06161, 2018.
[28] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[J]. IEEE transactions on pattern analysis & machine intelligence, 2020, 42(2): 318-327.
[29] 钱力, 刘熠, 张智雄,等. ChatGPT的技术基础分析[J]. 数据分析与知识发现, 2023, 7(3): 6-15. (QIAN L, LIU Y, ZHANG Z X, et al. An analysis on the basic technologies of ChatGPT[J]. Data analysis and knowledge discovery, 2023, 7(3): 6-15.)
[30] LIAKATA M, TEUFEL S, SIDDHARTHAN A, et al. Corpora for the conceptualisation and zoning of scientific papers[C]//Proceedings of the 2010 international conference on language resources and evaluation. Valletta: European Languages Resources Association, 2010: 105-108.
[31] TEUFEL S, MOENS M. Summarizing scientific articles: experiments with relevance and rhetorical status[J]. Computational linguistics, 2002, 28(4): 409-445.
[32] LIAKATA M, SAHA S, DOBNIK S, et al. Automatic recognition of conceptualization zones in scientific articles and two life science applications[J]. Bioinformatics, 2012, 28(7): 991-1000.
[33] RAVENSCROFT J, OELLRICH A, SAHA S, et al. Multi-label annotation in scientific articles-the multi-label cancer risk assessment corpus[C]//Proceedings of the tenth international conference on language resources and evaluation (LREC'16). Portorož: European Language Resources Association, 2016: 4115-4123.
[34] TEUFEL S. The structure of scientific articles: applications to citation indexing and summarization[M]. California: CSLI Publications, 2010.