REVIEW & COMMENTARY

Advances in Pre-trained Language Models From the Perspective of Information Science

  • Hu Haotian ,
  • Deng Sanhong ,
  • Wang Dongbo ,
  • Shen Si ,
  • Shen Jianwei
Expand
  • 1 School of Information Management, Nanjing University, Nanjing 210023;
    2 Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023;
    3 College of Information Management, Nanjing Agricultural University, Nanjing 210095;
    4 School of Economics & Management, Nanjing University of Science and Technology, Nanjing 210094;
    5 Jiangsu Institute of Quality and Standardization, Nanjing 210029

Received date: 2023-06-09

  Revised date: 2023-08-29

  Online published: 2024-02-23

Supported by

This work is supported by the National Natural Science Foundation of China titled "Research on Knowledge Graph Construction and Retrieval of Academic Full-text Based on Deep Learning" (Grant No. 71974094) and Fundamental Research Funds for the Central Universities of Ministry of Education of China (Grant No. 0108/14370317).

Abstract

[Purpose/Significance] This paper systematically reviews and analyzes the relevant research on pre-trained language models (PLM) in information science (IS) and intelligence work, and provides reference for the integration of pre-trained models and IS research. [Method/Process] Firstly, it briefly described the basic principles and development of PLM, and summarized the widely used PLM in IS research. Secondly, it analyzed the research hotspots at the macro level and summarized the related achievements in information organization, information retrivel, and information mining at the micro level. And it explored the application methods, improvement strategies, and performance of PLM in detail. Finally, it discussed the opportunities and challenges of PLM in IS in corpus, training, evaluation, and application. [Result/Conclusion] Currently, BERT and its variants are the most widely used and perform best in IS. The paradigm combining neural networks and fine-tuning is used in various scenarios, especially in domain information extraction and text classification. Its performance can be improved by continuing pre-training, external knowledge enhancement, and architecture optimization. The key issues to be considered in the future are balancing the scale and quality of the training corpus, improving the usability and security of the model, evaluating the real ability of the model with high accuracy and multi-dimensionality, and accelerating the implementation of subject knowledge mining tools.

Cite this article

Hu Haotian , Deng Sanhong , Wang Dongbo , Shen Si , Shen Jianwei . Advances in Pre-trained Language Models From the Perspective of Information Science[J]. Library and Information Service, 2024 , 68(3) : 130 -150 . DOI: 10.13266/j.issn.0252-3116.2024.03.012

References

[1] 苏新宁. 大数据时代情报学学科崛起之思考[J]. 情报学报, 2018, 37(5):451-459. (SU X N. The rise of intelligence studies in the age of big data[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(5):451-459.)
[2] 陈芬, 苏新宁. 我国情报学学科发展现状与未来思考[J]. 情报学报, 2019, 38(9):988-996. (CHEN F, SU X N. Current situation and thoughts about future developments of information science in China[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(9):988-996.)
[3] 杨建林, 苗蕾. 情报学学科建设面临的主要问题与发展方向[J]. 科技情报研究, 2019, 1(1):29-50. (YANG J L, MIAO L. Major problems in discipline construction of information science and the development direction of the discipline[J]. Scientific information research, 2019, 1(1):29-50.)
[4] 苏新宁. 不忘初心、牢记使命展望情报学与情报工作的未来[J]. 科技情报研究, 2019, 1(1):1-12. (SU X N. Remain true to our original aspiration and keep our mission in mind looking to the future of intelligence studies and work[J]. Scientific information research, 2019, 1(1):1-12.)
[5] 孙建军, 李阳. 论情报学与情报工作" 智慧" 发展的几个问题[J]. 信息资源管理学报, 2019, 9(1):4-8. (SUN J J, LI Y. On several issues about the "smart" development of intelligence studies and intelligence work[J]. Journal of information resources management, 2019, 9(1):4-8.)
[6] LIU P, YUAN W, FU J, et al. Pre-train, prompt, and predict:a systematic survey of prompting methods in natural language processing[J]. arXiv preprint arXiv:2107.13586, 2021.
[7] HAN X, ZHANG Z, DING N, et al. Pre-trained models:past, present and future[J]. AI open, 2021, 2:225-250.
[8] 刘欢, 张智雄, 王宇飞. BERT模型的主要优化改进方法研究综述[J]. 数据分析与知识发现, 2021, 5(1):3-15. (LIU H, ZHANG Z X, WANG Y F. A review on main optimization methods of bert[J]. Data analysis and knowledge discovery, 2021, 5(1):3-15.)
[9] SHANNON C E. A mathematical theory of communication[J]. The bell system technical journal, 1948, 27(3):379-423.
[10] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[11] PENNINGTON J, SOCHER R, MANNING C D. Glove:global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing. Doha:Association for Computational Linguistics, 2014:1532-1543.
[12] JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[J]. arXiv preprint arXiv:1607.01759, 2016.
[13] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv:1802.05365, 2018.
[14] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pretraining[EB/OL].[2023-12-02]. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.
[15] DEVLIN J, CHANG M W, LEE K, et al. Bert:pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[16] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8):9.
[17] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[J]. arXiv preprint arXiv:2005.14165, 2020.
[18] YANG Z, DAI Z, YANG Y, et al. Xlnet:generalized autoregressive pretraining for language understanding[J]. arXiv preprint arXiv:1906.08237, 2019.
[19] ZHAO L, LIU Y, ZHANG M Y, et al. Modeling label-wise syntax for fine-grained sentiment analysis of reviews via-based neural model[J]. Information processing & management, 2021, 58(5):102641.
[20] PAMUNGKAS E W, BASILE V, PATTI V. A joint learning approach with knowledge injection for zero-shot cross-lingual hate speech detection[J]. Information processing & management, 2021, 58(4):102544.
[21] 潘宏鹏, 汪东, 刘忠轶, 等. 考虑反讽语义识别的协同双向编码舆情评论情感分析研究[J]. 情报杂志, 2022, 41(5):99-105, 111. (PAN H P, WANG D, LIU Z Y, et al. Public opinion comments sentiment analysis research considering ironic semantic recognition based on the collaborative bert[J]. Journal of intelligence, 2022, 41(5):99-105, 111.)
[22] 常城扬, 王晓东, 张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析[J]. 数据分析与知识发现, 2021, 5(3):121-131. (CHANG C Y, WANG X D, ZHANG S L. Polarity analysis of dynamic political sentiments from tweets with deep learning method[J]. Data analysis and knowledge discovery, 2021, 5(3):121-131.)
[23] JIN W, ZHAO B, ZHANG L, et al. Back to common sense:Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis[J]. Information processing & management, 2023, 60(3):103260.
[24] LEI Y T, LI Y T. A novel scheme of domain transfer in documentlevel cross-domain sentiment classification[J]. Journal of information science, 2023, 49(3):567-581.
[25] 刘继, 顾凤云. 基于BERT与BiLSTM混合方法的网络舆情非平衡文本情感分析[J]. 情报杂志, 2022, 41(4):104-110. (LIU J, GU F Y. Unbalanced text sentiment analysis of network public opinion based on BERT and BiLSTM hybrid method[J]. Journal of intelligence, 2022, 41(4):104-110.)
[26] 余本功, 张书文. 基于BAGCNN的方面级别情感分析研究[J]. 数据分析与知识发现, 2021, 5(12):37-47. (YU B G, ZHANG S W. Aspect-level sentiment analysis based on BAGCNN[J]. Data analysis and knowledge discovery, 2021, 5(12):37-47.)
[27] YANG T T, LI F, JI D H, et al. Fine-grained depression analysis based on Chinese micro-blog reviews[J]. Information processing & management, 2021, 58(6):102681.
[28] 赖宇斌, 陈燕, 胡小春, 等. 基于提示嵌入的突发公共卫生事件微博文本情感分析[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.g2.20230206.1809.001.html. (LAI Y B, CHEN Y, HU X C, et al. Emotional analysis of public health emergency micro-blog based on prompt embedding[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.g2.20230206.1809.001.html.)
[29] 黄泰峰, 马静. 基于提示学习增强的文本情感分类算法[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230711.1243.008.html. (HUANG T F, MA J. Text sentiment classification algorithm based on prompt learning enhancement[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230711.1243.008.html.)
[30] 陆伟, 李鹏程, 张国标, 等. 学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究[J]. 情报学报, 2020, 39(12):1320-1329. (LU W, LI P C, ZHANG G B, et al. Recognition of lexical functions in academic texts:automatic classification of keywords based on BERT vectorization[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(12):1320-1329.)
[31] 秦成磊, 章成志. 基于层次注意力网络模型的学术文本结构功能识别[J]. 数据分析与知识发现, 2020, 4(11):26-42. (QIN C L, ZHANG C Z. Recognizing structure functions of academic articles with hierarchical attention network[J]. Data analysis and knowledge discovery, 2020, 4(11):26-42.)
[32] 马晓慧, 赵文娟, 刘忠宝. 基于深度学习的多学科多层次学术论文结构功能识别方法比较研究[J]. 情报科学, 2021, 39(8):94-102. (MA X H, ZHAO W J, LIU Z B. Multi-disciplinary and multi-level comparative research on methods of academic text structure function recognition based on deep learning[J]. Information science, 2021, 39(8):94-102.)
[33] 胡忠义, 税典程, 吴江. 基于ERNIE和DPCNN的科技文献摘要结构要素识别[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.g2.20230506.1736.002.html. (HU Z Y, SHUI D C, WU J. ERNIE-DPCNN-based structural elements identification of abstracts in academic literature[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.g2.20230506.1736.002.html.)
[34] 杜新玉, 李宁. 中文学术论文全文语步识别研究[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230224.1443.004.html. (DU X Y, LI N. Research on recognition of moves in full-text Chinese academic papers[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230224.1443.004.html.)
[35] YU G, ZHANG Z, LIU H, et al. Masked sentence model based on BERT for move recognition in medical scientific abstracts[J]. Journal of data and information science, 2019, 4(4):42-55.
[36] 王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6):60-68. (WANG M, CUI Y P, CHEN L, et al. A deep learning-based method of argumentative zoning for research articles[J]. Data analysis and knowledge discovery, 2020, 4(6):60-68.)
[37] 张国标, 李鹏程, 陆伟, 等. 多特征融合的关键词语义功能识别研究[J]. 图书情报工作, 2021, 65(9):89-96. (ZHANG G B, LI P C, LU W, et al. Research on keyword semantic function recognition based on multi-feature fusion[J]. Library and information service, 2021, 65(9):89-96.)
[38] 郭航程, 何彦青, 兰天, 等. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究[J]. 数据分析与知识发现, 2022, 6(Z1):298-307. (GUO H, C HE Y Q, LAN T, et al. Identifying moves from scientific abstracts based on ParagraphBERT-CRF[J]. Data analysis and knowledge discovery, 2022, 6(Z1):298-307.)
[39] 罗鹏程, 王一博, 王继民. 基于深度预训练语言模型的文献学科自动分类研究[J]. 情报学报, 2020, 39(10):1046-1059. (LUO P C, WANG Y B, WANG J M. Automatic discipline classification for scientific papers based on a deep pre-training language model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(10):1046-1059.)
[40] 周泽聿, 王昊, 赵梓博, 等. 融合关联信息的GCN文本分类模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(9):31-41. (ZHOU J Y, WANG H, ZHAO Z B, et al. Construction and application of GCN model for text classification with associated information[J]. Data analysis and knowledge discovery, 2021, 5(9):31-41.)
[41] CARVALLO A, PARRA D, LOBEL H, et al. Automatic document screening of medical literature using word and text embeddings in an active learning setting[J]. Scientometrics, 2020, 125(3):3047-3084.
[42] 刘江峰, 林立涛, 刘畅, 等. 深度学习驱动的海量人文社会科学学术文献学科分类研究[J]. 情报理论与实践, 2023, 46(2):71-81. (LIU J F, LIN L T, LIU C, et al. Study on the discipline classification of massive humanities and social science academic literature driven by deep learning[J]. Information studies:theory & application, 2023, 46(2):71-81.)
[43] 吕琦, 上官燕红, 张琳, 等. 基于文本内容自动分类的跨学科测度研究[J]. 数据分析与知识发现, 2023, 7(4):56-67. (LV Q, SHANGGUAN Y H, ZHANG L. Interdisciplinary measurement based on automatic classification of text content[J]. Data analysis and knowledge discovery, 2023, 7(4):56-67.)
[44] 赵旸, 张智雄, 刘欢. 基于层次分类法的中文医学文献分类研究[J]. 图书馆学研究, 2021(21):49-55, 61. (ZHAO Y, ZHANG Z X, LIU H. A research on automatic classification of Chinese medical literature based on hierarchical classification[J]. Research on library science, 2021(21):49-55, 61.)
[45] 戎璐, 张亚洲. 一种注意力序列到序列模型的生成式层次文档分类[J]. 图书馆学研究, 2022(5):45-56. (RONG L, ZHANG Y Z. An attentive sequence to sequence learning model for hierarchical document classification[J]. Research on library science, 2022(5):45-56.)
[46] 梁媛, 王东波, 黄水清. 古籍同事异文的自动发掘研究[J]. 图书情报工作, 2021, 65(9):97-104. (LIANG Y, WANG D B, HUANG S Q. Research on automatic mining of variants expressing the same event in the ancient books[J]. Library and information service, 2021, 65(9):97-104.)
[47] 周好, 王东波, 黄水清. 古籍引书上下文自动识别研究——以注疏文献为例[J]. 情报理论与实践, 2021, 44(9):169-175. (ZHOU H, WANG D B, HUANG S Q. Automatic recognition citation context in early Chinese literature:take the annotated literature as an example[J]. Information studies:theory & application, 2021, 44(9):169-175.)
[48] 胡昊天, 张逸勤, 邓三鸿, 等. 面向数字人文的《四库全书》 子部自动分类研究——以SikuBERT和SikuRoBERTa预训练模型为例[J]. 图书馆论坛, 2022, 42(12):138-148. (HU H T, ZHANG Y Q, DENG S H, et al. Automatic text classification of "Zi" part of Siku Quanshu from the perspective of digital humanities:based on SikuBERT and SikuRoBERTa pre-trained models[J]. Library tribune, 2022, 42(12):138-148.)
[49] 张力元, 王军. 基于机器学习的古籍目录互著与别裁探析[J]. 中国图书馆学报, 2022, 48(2):47-61. (ZHANG L Y, WANG J. Research on inter record and analytic record of classical bibliography based on machine learning[J]. Journal of library science in China, 2022, 48(2):47-61.)
[50] 高瑞卿, 董启文, 方达, 等. 数字技术下《老子》文本与先秦两汉典籍的关系挖掘[J]. 情报杂志, 2021, 40(10):99-107. (GAO R Q, DONG Q W, FANG D, et al. Research on the relationship between the text of "Laozi" and the classics of the pre-Qin and Han dynasties based on digital technologies[J]. Journal of intelligence, 2021, 40(10):99-107.)
[51] MAO Y Q, FUNG K W. Use of word and graph embedding to measure semantic relatedness between unified medical language system concepts[J]. Journal of the American Medical Informatics Association, 2020, 27(10):1538-1546.
[52] LI R S, YU Q Y, HUANG S B, et al. Phrase embedding learning from internal and external information based on autoencoder[J]. Information processing & management, 2021, 58(1):102422.
[53] 李纲, 余辉, 毛进. 基于多层语义相似的技术供需文本匹配模型研究[J]. 数据分析与知识发现, 2021, 5(12):25-36. (LI G, YU H, MAO J. Matching model for technology supply and demand texts based on multi-layer semantic similarity[J]. Data analysis and knowledge discovery, 2021, 5(12):25-36.)
[54] XIE Q, ZHANG X Y, DING Y, et al. Monolingual and multilingual topic analysis using LDA and BERT embeddings[J]. Journal of informetrics, 2020, 14(3):101055.
[55] 梁继文, 杨建林, 王伟, 等. 科技项目及其成果文献的相关性评估研究[J]. 情报学报, 2022, 41(2):155-166. (LIANG J W, YANG J L, WANG W, et al. Analysis of the relevance evaluation of scientific-technological projects and achievements[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(2):155-166.)
[56] 郑洁, 黄辉, 秦永彬. 一种融合法律知识的相似案例匹配模型[J]. 数据分析与知识发现, 2022, 6(7):99-106. (ZHENG J, HUANG H, QIN Y B. Matching similar cases with legal knowledge fusion[J]. Data analysis and knowledge discovery, 2022, 6(7):99-106.)
[57] DING J, CHEN Y, LIU C. Exploring the research features of nobel laureates in physics based on the semantic similarity measurement[J]. Scientometrics, 2023, 128(9):5247-5275.
[58] 牛海波, 赵丹群, 郭倩影. 基于BERT和引文上下文的文献表征与检索方法研究[J]. 情报理论与实践, 2020, 43(9):125-131. (NIU H B, ZHAO D Q, GUO Q Y. Research on paper embedding and retrieval method based on BERT and citation context[J]. Information studies:theory & application, 2020, 43(9):125-131.)
[59] SAKATA W, SHIBATA T, TANAKA R, et al. FAQ retrieval using query-question similarity and BERT-based query-answer relevance[C]//Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. New York:Association for Computing Machinery, 2019:1113-1116.
[60] 吕学强, 杜一凡, 张乐, 等. GKTR:一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230509.1615.002.html. (LÜ X Q, DU Y F, ZHANG L, et al. GKTR:a retrieval model for engineering consulting reports fusing graph convolution topological features and keyword features[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230509.1615.002.html.)
[61] 罗鹏程, 王继民, 王世奇, 等. 基于深度学习的科学数据集检索方法研究[J]. 情报理论与实践, 2022, 45(7):49-56. (LUO P C, WANG J M, WANG S Q, et al. Research on deep learning based scientific dataset retrieval method[J]. Information studies:theory & application, 2022, 45(7):49-56.)
[62] NOGUEIRA R, JIANG Z Y, CHO K, et al. Navigation-based candidate expansion and pretrained language models for citation recommendation[J]. Scientometrics, 2020, 125(3):3001-3016.
[63] WANG J M, PAN M, HE T T, et al. A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval[J]. Information processing & management, 2020, 57(6):102342.
[64] ZHENG Z, HUI K, HE B, et al. Contextualized query expansion via unsupervised chunk selection for text retrieval[J]. Information processing & management, 2021, 58(5):102672.
[65] 王日花. 基于多层异构网络的自动问答模型研究[J]. 情报科学, 2021, 39(10):76-87. (WANG R H. Automatic question answering model based on multi layers heterogeneous network[J]. Information science, 2021, 39(10):76-87.)
[66] 程子佳, 陈翀. 面向流行性疾病科普的用户问题理解与答案内容组织[J]. 数据分析与知识发现, 2022, 6(Z1):202-211. (CHENG Z J, CHEN C. Question comprehension and answer organization for scientific education of epidemics[J]. Data analysis and knowledge discovery, 2022, 6(Z1):202-211.)
[67] LI L, LI C L, JI D H. Deep context modeling for multiturn response selection in dialogue systems[J]. Information processing & management, 2021, 58(1):102415.
[68] ORAL B, EMEKLIGIL E, ARSLAN S, et al. Information extraction from text intensive and visually rich banking documents[J]. Information processing & management, 2020, 57(6):102361.
[69] 沈思, 左明聪, 王东波, 等. 基于课表知识抽取的情报学课程设置启示研究[J]. 情报学报, 2020, 39(12):1253-1263. (SHEN S, ZUO M C, WANG D B, et al. Research on information science curricula based on curriculum knowledge extraction[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(12):1253-1263.)
[70] 胡昊天, 王东波, 邓三鸿, 等. 基于情报学招聘实体挖掘的情报学教育及人才培养分析[J]. 情报理论与实践, 2021, 44(1):8-17. (HU H T, WANG D B, DENG S H, et al. Analyzing the information science education and training talents based on mining the information science recruitment entity[J]. Information studies:theory & application, 2021, 44(1):8-17.)
[71] 梁媛, 彭秋茹, 王东波, 等. 数据科学任职要求挖掘下的情报学教育及人才培养[J]. 情报理论与实践, 2021, 44(2):8-15, 25. (LIANG Y, PENG Q R, WANG D B, et al. Information science education and training talents under the job requirements knowledge mining[J]. Information studies:theory & application, 2021, 44(2):8-15, 25.)
[72] 刘浏, 伊凡, 王东波, 等. iSchools培养计划知识挖掘下的情报学教育及人才培养[J]. 情报理论与实践, 2021, 44(2):26-32. (LIU L, YI F, WANG D B, et al. Research on teaching of information science based on iSchool training program knowledge mining[J]. Information studies:theory & application, 2021, 44(2):26-32.)
[73] 景慎旗, 赵又霖. 面向中文电子病历文书的医学命名实体识别研究——一种基于半监督深度学习的方法[J]. 信息资源管理学报, 2021, 11(6):105-115. (JING S Q, ZHAO Y L. Recognizing clinical named entity from Chinese electronic medical record texts based on semi-supervised deep learning[J]. Journal of information resources management, 2021, 11(6):105-115.)
[74] FAN Y D, ZHOU S C, LI Y F, et al. Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text[J]. Journal of the American Medical Informatics Association, 2021, 28(3):569-577.
[75] YANG X, BIAN J, HOGAN W R, et al. Clinical concept extraction using transformers[J]. Journal of the American Medical Informatics Association, 2020, 27(12):1935-1942.
[76] DU J C, XIANG Y, SANKARANARAYANAPILLAI M, et al. Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning[J]. Journal of the American Medical Informatics Association, 2021, 28(7):1393-1400.
[77] FAN B, FAN W G, SMITH C, et al. Adverse drug event detection and extraction from open data:a deep learning approach[J]. Information processing & management, 2020, 57(1):102131.
[78] 张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别[J]. 数据分析与知识发现, 2022, 6(Z1):242-250. (ZHANG Y Q, WANG Y, LI B C. Identifying named entities of Chinese electronic medical records based on RoBERTa-wwm dynamic fusion model[J]. Data analysis and knowledge discovery, 2022, 6(Z1):242-250.)
[79] 崔竞烽, 郑德俊, 王东波, 等. 基于深度学习模型的菊花古典诗词命名实体识别[J]. 情报理论与实践, 2020, 43(11):150-155. (CUI J F, ZHENG D J, WANG D B, et al. Named entity recognition of chrysanthemum poetry based on deep learning models[J]. Information studies:theory & application, 2020, 43(11):150-155.)
[80] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究[J]. 数据分析与知识发现, 2020, 4(8):86-97. (XU C F, YE H Y, BAO P. Automatic recognition of produce entities from local chronicles with deep learning[J]. Data analysis and knowledge discovery, 2020, 4(8):86-97.)
[81] 杜悦, 王东波, 江川, 等. 数字人文下的典籍深度学习实体自动识别模型构建及应用研究[J]. 图书情报工作, 2021, 65(3):100-108. (DU Y, WANG D B, JIANG C, et al. Construction and application of entity recognition model based on deep learning of classics in digital humanities[J]. Library and information service, 2021, 65(3):100-108.)
[82] 刘江峰, 冯钰童, 王东波, 等. 数字人文视域下SikuBERT增强的史籍实体识别研究[J]. 图书馆论坛, 2022, 42(10):61-72. (LIU J F, FENG Y T, WANG D B, et al. Research on sikubert-enhanced entity recognition of historical records from the perspective of digital humanities[J]. Library tribune, 2022, 42(10):61-72.)
[83] 林立涛, 王东波, 刘江峰, 等. 数字人文视域下典籍动物命名实体识别研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(10):42-50. (LIN L T, WANG D B, LIU J F, et al. Animal named entity recognition in ancient Chinese classics from the perspective of digital humanities:based on SikuBERTpretraining model[J]. Library tribune, 2022, 42(10):42-50.)
[84] 喻雪寒, 何琳, 徐健. 基于RoBERTa-CRF的古文历史事件抽取方法研究[J]. 数据分析与知识发现, 2021, 5(7):26-35. (YU X H, HE L, XU J. Extracting events from ancient books based on RoBERTa-CRF[J]. Data analysis and knowledge discovery, 2021, 5(7):26-35.)
[85] 任秋彤, 王昊, 熊欣, 等. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(12):123-136. (REN Q T, WANG H, XIONG X, et al. Extracting drama terms with GCN long-distance constrain[J]. Data analysis and knowledge discovery, 2021, 5(12):123-136.)
[86] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12):68-75. (LIU L, QIN T Y, WANG D B. Automatic extraction of traditional music terms of intangible cultural heritage[J]. Data analysis and knowledge discovery, 2020, 4(12):68-75.)
[87] 熊欣, 王昊, 邓三鸿. 面向方志知识图谱的术语抽取模型迁移学习研究[J]. 情报理论与实践, 2021, 44(4):176-184. (XIONG X, WANG H, DENG S H. A study on term extraction model with transfer learning for knowledge graph of local chronicles[J]. Information studies:theory & application, 2021, 44(4):176-184.)
[88] 张卫, 王昊, 邓三鸿, 等. 面向数字人文的古诗文本情感术语抽取与应用研究[J]. 中国图书馆学报, 2021, 47(4):113-131. (ZHANG W, WANG H, DENG S H, et al. Sentiment term extraction and application of Chinese ancient poetry text for digital humanities[J]. Journal of library science in China, 2021, 47(4):113-131.)
[89] 翟羽佳, 田静文, 赵玥. 基于BERT-BiLSTM-CRF模型的算法术语抽取与创新演化路径构建研究[J]. 情报科学, 2022, 40(4):71-78. (ZHAI Y J, TIAN J W, ZHAO Y. Algorithm term extraction and innovation evolution path construction based on BERT-BiLSTM-CRF model[J]. Information science, 2022, 40(4):71-78.)
[90] WU Q H, LI D F, HUANG L, et al. Optimization of hierarchical reinforcement learning relationship extraction model[J]. Information discovery and delivery, 2020, 48(3):129-136.
[91] 彭博. 基于ALBERT的网络文物信息资源实体关系抽取方法研究[J]. 情报杂志, 2022, 41(8):156-162, 178. (PENG B. Research on entity relationship extraction of cultural relic information resources with ALBERT[J]. Journal of intelligence, 2022, 41(8):156-162, 178.)
[92] 景慎旗, 赵又霖. 基于医学领域知识和远程监督的医学实体关系抽取研究[J]. 数据分析与知识发现, 2022, 6(6):105-114. (JING S Q, ZHAO Y L. Extracting medical entity relationships with domain-specific knowledge and distant supervision[J]. Data analysis and knowledge discovery, 2022, 6(6):105-114.)
[93] YUAN C, CAO Y, HUANG H. Collective prompt tuning with relation inference for document-level relation extraction[J]. Information processing & management, 2023, 60(5):103451.
[94] 刘忠宝, 党建飞, 张志剑. 《史记》历史事件自动抽取与事理图谱构建研究[J]. 图书情报工作, 2020, 64(11):116-124. (LIU Z B, DANG J F, ZHANG Z J. Research on automatic extraction of historical events and construction of event graph based on historical records[J]. Library and information service, 2020, 64(11):116-124.)
[95] 钱玲飞, 崔晓蕾. 基于数据增强的领域知识图谱构建方法研究[J]. 现代情报, 2022, 42(3):31-39. (QIAN L F, CUI X L. Research on construction method of domain knowledge graph based on transfer learning[J]. Journal of modern information, 2022, 42(3):31-39.)
[96] PENG C, YANG X, YU Z, et al. Clinical concept and relation extraction using prompt-based machine reading comprehension[J]. arXiv preprint arXiv:2303.08262, 2023.
[97] 鲍彤, 章成志. ChatGPT中文信息抽取能力测评——以三种典型的抽取任务为例[J]. 数据分析与知识发现, 2023, 7(9):1-11. (BAO T, ZHANG C Z. Performance evaluation of ChatGPT on Chinese information extraction:an empirical study by three typical extraction tasks[J]. Data analysis and knowledge discovery, 2023, 7(9):1-11.)
[98] JI D H, GAO J, FEI H, et al. A deep neural network model for speakers coreference resolution in legal texts[J]. Information processing & management, 2020, 57(6):102365.
[99] 阮光册, 涂世文, 田欣, 等. 多特征融合的英文科技文献增量式人名消歧应用研究[J]. 情报杂志, 2021, 40(9):147-153. (RUAN G C, TU S W, TIAN X, et al. Application research of incremental person name disambiguation in english scientific and technological literature based on multi feature fusion[J]. Journal of intelligence, 2021, 40(9):147-153.)
[100] 陈诗, 王东波, 黄水清. 数字人文下的典籍人称代词指代消解研究[J]. 情报理论与实践, 2021, 44(10):165-172. (CHEN S, WANG D B, HUANG S Q. Research on the resolution of personal pronoun in classical books under the digital humanism[J]. Information studies:theory & application, 2021, 44(10):165-172.)
[101] 韩普, 张展鹏, 张伟. 基于多任务学习和多态语义特征的中文疾病名称归一化研究[J]. 情报学报, 2021, 40(11):1234-1244. (HAN P, ZHANG Z P, ZHANG W. Chinese disease name normalization based on multi-task learning and polymorphic semantic features[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(11):1234-1244.)
[102] 李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究[J]. 数据分析与知识发现, 2021, 5(7):1-9. (LI W N, ZHANG Z X. Entity alignment method for different knowledge repositories with joint semantic representation[J]. Data analysis and knowledge discovery, 2021, 5(7):1-9.)
[103] 张琪, 江川, 纪有书, 等. 面向多领域先秦典籍的分词词性一体化自动标注模型构建[J]. 数据分析与知识发现, 2021, 5(3):2-11. (ZHANG Q, JIANG C, JI Y S, et al. Unified model for word segmentation and pos tagging of multi-domain pre-Qin literature[J]. Data analysis and knowledge discovery, 2021, 5(3):2-11.)
[104] 张逸勤, 邓三鸿, 胡昊天, 等. 预训练模型视角下的跨语言典籍风格计算研究[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230207.1133.005.html. (ZHANG Y Q, DENG S H, HU H T, et al. Research on the style calculation of cross-language classics from the perspective of pre-training models[EB/OL].[2023-12-02]. http://kns.cnki.net/kcms/detail/10.1478.G2.20230207.1133.005.html.)
[105] 胡昊天, 邓三鸿, 张逸勤, 等. 数字人文视角下的非物质文化遗产文本自动分词及应用研究[J]. 图书馆杂志, 2022, 41(8):76-83. (HU H T, DENG S H, ZHANG Y, Q et al. Chinese word segmentation and application of intangible cultural heritage texts from the perspective of digital humanities[J]. Library journal, 2022, 41(8):76-83.)
[106] 唐雪梅, 苏祺, 王军, 等. 基于图卷积神经网络的古汉语分词研究[J]. 情报学报, 2023, 42(6):740-750. (TANG X M, SU Q, WANG J, et al. Ancient Chinese word segmentation based on graph convolutional neural network[J]. Journal of the China Society for Scientific and Technical Information, 2023, 42(6):740-750.)
[107] 刘畅, 王东波, 胡昊天, 等. 面向数字人文的融合外部特征的典籍自动分词研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6):44-54. (LIU C, WANG D B, HU H T, et al. Automatic word segmentation of classic books with external features for digital humanities:a case study of SikuBERTpretraining model[J]. Library tribune, 2022, 42(6):44-54.)
[108] 耿云冬, 张逸勤, 刘欢, 等. 面向数字人文的中国古代典籍词性自动标注研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6):55-63. (GENG Y D, ZHANG Y Q, LIU H, et al. Automatic part-of-speech tagging of ancient Chinese texts in the context of digital humanities:a case study on SikuBERT's pretrained language model[J]. Library tribune, 2022, 42(6):55-63.)
[109] 赵连振, 张逸勤, 刘江峰, 等. 面向数字人文的先秦两汉典籍自动标点研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(12):120-128, 137. (ZHAO L Z, ZHANG Y Q, LIU J F, et al. Study on automatic punctuation of ancient Chinese classics of pre-Qin and Han dynasties in the context of digital humanities:taking SikuBERT pre-training model for example[J]. Library tribune, 2022, 42(12):120-128, 137.)
[110] 王倩, 王东波, 李斌, 等. 面向海量典籍文本的深度学习自动断句与标点平台构建研究[J]. 数据分析与知识发现, 2021, 5(3):25-34. (WANG Q, WANG D B, LI B, et al. Deep learning based automatic sentence segmentation and punctuation model for massive classical Chinese literature[J]. Data analysis and knowledge discovery, 2021, 5(3):25-34.)
[111] 李佩琪, 王昊, 任秋彤, 等. 融合结构特性的语义增强式古籍句读识别方法研究[J]. 情报学报, 2023, 42(2):150-163. (LI P Q, WANG H, REN Q T, et al. Study of antiquarian punctuation recognition methods incorporating semantic enhancement with structural properties[J]. Journal of the China Society for Scientific and Technical Information, 2023, 42(2):150-163.)
[112] LEE J S, HSIANG J. Patent claim generation by fine-tuning OpenAI GPT-2[J]. World patent information, 2020, 62:101983.
[113] TAKESHITA S, GREEN T, FRIEDRICH N, et al. Cross-lingual extreme summarization of scholarly documents[EB/OL].[2023-12-02]. https://link.springer.com/article/10.1007/s00799-023-00373-2.
[114] LAMSIYAH S, MAHDAOUY A E, OUATIK S E A, et al. Unsupervised extractive multi-document summarization method based on transfer learning from BERT multi-task fine-tuning[J]. Journal of information science, 2023, 49(1):164-182.
[115] 王义真, 欧石燕, 陈金菊. 民事裁判文书两阶段式自动摘要研究[J]. 数据分析与知识发现, 2021, 5(5):104-114. (WANG Y Z, OU S Y, CHEN J J. Automatic abstracting civil judgment documents with two-stage procedure[J]. Data analysis and knowledge discovery, 2021, 5(5):104-114.)
[116] 徐润华, 王东波, 刘欢, 等. 面向古籍数字人文的《资治通鉴》自动摘要研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(12):129-137. (XU R H, WANG D B, LIU H, et al. Automatic summarization of Zizhi Tongjian from the perspective of digital humanities based on ancient Chinese books:a case of SikuBERT pre-training model[J]. Library tribune, 2022, 42(12):129-137.)
[117] 王宇飞, 张智雄, 赵旸, 等. 中文科技论文标题自动生成系统的设计与实现[J]. 数据分析与知识发现, 2023, 7(2):61-71. (WANG Y F, ZHANG Z X, ZHAO Y, et al. Designing and implementing automatic title generation system for sci-tech papers[J]. Data analysis and knowledge discovery, 2023, 7(2):61-71.)
[118] 刘江峰, 刘雏菲, 齐月, 等. AIGC助力数字人文研究的实践探索:SikuGPT驱动的古诗词生成研究[J]. 情报理论与实践, 2023, 46(5):23-31. (LIU J F, LIU C F, QI Y, et al. A practical exploration of AIGC-powered digital humanities research:a sikugpt driven research of ancient poetry generation[J]. Information studies:theory & application, 2023, 46(5):23-31.)
[119] 王彦莹, 王昊, 朱惠, 等. 基于文本生成技术的历史古籍事件识别模型构建研究[J]. 图书情报工作, 2023, 67(3):119-130. (WANG Y Y, WANG H, ZHU H, et al. Research on the construction of an event recognition model for historical antique books based on text generation technology[J]. Library and information service, 2023, 67(3):119-130.)
Outlines

/