知识组织

文本主题视角下多标签分类技术驱动的网络学术社区答案排序研究

  • 林立涛 ,
  • 吴梦成 ,
  • 刘畅 ,
  • 胡蝶 ,
  • 王东波 ,
  • 黄水清
展开
  • 1 南京大学信息管理学院 南京 210023;
    2 南京农业大学信息管理学院 南京 210095;
    3 人文与社会计算江苏省高校哲学社会科学重点研究基地 南京 210095;
    4 南京农业大学领域知识关联研究中心 南京 210095
林立涛,博士研究生;吴梦成,博士研究生;刘畅,博士研究生;胡蝶,硕士研究生;黄水清,教授,博士生导师。

收稿日期: 2023-08-04

  修回日期: 2023-11-22

  网络出版日期: 2024-03-27

Research on Answer Ranking in Online Academic Communities Based on Multi-Label Classification Technology from the Perspective of Text Topics

  • Lin Litao ,
  • Wu Mengcheng ,
  • Liu Chang ,
  • Hu Die ,
  • Wang Dongbo ,
  • Huang Shuiqing
Expand
  • 1 School of Information Management, Nanjing University, Nanjing 210023;
    2 College of Information Management, Nanjing Agricultural University, Nanjing 210095;
    3 Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095;
    4 Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095

Received date: 2023-08-04

  Revised date: 2023-11-22

  Online published: 2024-03-27

摘要

[目的/意义]网络学术社区中的用户生成答案质量良莠不齐,难以为用户提供高效的决策支持,筛选高可用性答案能够促进网络学术社区问答知识的高效利用。[方法/过程]从文本主题语义视角出发,提出一种基于深度预训练语言模型和多标签分类技术的问答相关性计算方法,用于实现对网络学术社区用户答案的有用性排序。该方法首先提取问题文本和答案文本的语义向量,然后进一步将其映射到领域化的主题向量空间,从而实现对问题和答案主题相似度的计算。[结果/结论]以“小木虫”学术社区论文投稿板块“求助完结”栏目下的所有提问及每条提问下的全部答案为实验数据,以NDCG、Q-Measure为评测指标,将本文方法与Cross-Encoder和Bi-Encoder两种基于语义的常规排序方法进行比较,发现本文方法与常规方法性能相当,但是对标注数据的需求更少。

本文引用格式

林立涛 , 吴梦成 , 刘畅 , 胡蝶 , 王东波 , 黄水清 . 文本主题视角下多标签分类技术驱动的网络学术社区答案排序研究[J]. 图书情报工作, 2024 , 68(5) : 121 -131 . DOI: 10.13266/j.issn.0252-3116.2024.05.012

Abstract

[Purpose/Significance] The uneven quality of user-generated answers in online academic communities makes it difficult for users to obtain efficient decision support. Filtering high-availability answers can promote the efficient use of question and answer knowledge in online academic communities. [Method/Process] From the perspective of text topic semantics, this paper proposed a question and answer correlation calculation based on a deep pre-training language model and multi-label classification technology, which was used to achieve the usefulness ranking of user generated answers. It first extracted the semantic vectors of question and answer text, and then further mapped them into a field-specific topic vector space, thereby realizing the calculation of topic similarity between questions and answers. [Result/Conclusion] Taking all the questions and answers under the “Help Completion” of the thesis submission in “Xiaomuchong” academic community as experimental data, it uses NDCG and Q-Measure as evaluation indicators, and compares with two conventional semantic-based sorting methods such as Cross-Encoder and Bi-Encoder. Experiment result shows that the performance of the proposed method is equivalent to that of conventional methods, but requires less annotation data.

参考文献

[1] 易明, 张婷婷.大众性问答社区答案质量排序方法研究[J].数据分析与知识发现, 2019, 3(6):12-20.(YI M, ZHANG T T.Ranking answer quality of popular Q&A community[J].Data analysis and knowledge discovery, 2019, 3(6):12-20.)
[2] 徐彤阳, 滕琦.基于深度学习的虚拟学术社区智能问答研究[J].情报杂志, 2021, 40(4):163-169.(XU T Y, TENG Q.Research on intelligent question answering in virtual academic community based on deep learning[J].Journal of intelligence, 2021, 40(4):163-169.)
[3] 王伟, 冀宇强, 王洪伟, 等.中文问答社区答案质量的评价研究:以知乎为例[J].图书情报工作, 2017, 61(22):36-44.(WANG W, JI Y Q, WANG H W, et al.Evaluating Chinese answers' quality in the community QA system:a case study of Zhihu[J].Library and information service, 2017, 61(22):36-44.)
[4] 谢陈博.社交问答平台答案有用性评价影响因素研究[J].现代商贸工业, 2019, 40(10):56-59.(XIE C B.Research on factors affecting the usefulness evaluation of answers on social Q&A platforms[J].Modern business trade industry, 2019, 40(10):56-59.)
[5] 沈旺, 李世钰, 刘嘉宇, 等.问答社区回答质量评价体系优化方法研究[J].数据分析与知识发现, 2021, 5(2):83-93.(SHEN W, LI S Y, LIU J Y, et al.Optimizing quality evaluation for answers of Q&A community[J].Data analysis and knowledge discovery, 2021, 5(2):83-93.)
[6] 郭顺利, 张向先, 陶兴, 等.社会化问答社区用户生成答案质量自动化评价研究——以"知乎" 为例[J].图书情报工作, 2019, 63(11):118-130.(GUO S L, ZHANG X X, TAO X, et al.Research on automated evaluation of user generated answer quality in social question and answer community:taking "Zhihu" as an example[J].Library and information service, 2019, 63(11):118-130.)
[7] YANG L, QIU M, GOTTIPATI S, et al.CQArank:jointly model topics and expertise in community question answering[C]//Proceedings of the 22nd ACM international conference on information & knowledge management.New York:Association for Computing Machinery, 2013:99-108.
[8] 张成, 曲明成, 倪宁, 等.基于概率潜在语义分析模型的自动答案选择[J].计算机工程, 2011, 37(14):70-72.(ZHANG C, QU M C, NI N, et al.Automatic answer selection based on probabilistic latent semantic analysis model[J].Computer engineering, 2011, 37(14):70-72.)
[9] GUO L, HU X.Identifying authoritative and reliable contents in community question answering with domain knowledge[C]//LI J, CAO L, WANG C, et al.Trends and applications in knowledge discovery and data mining.Berlin:Springer, 2013:133-142.
[10] 袁健, 刘瑜.基于混合式的社区问答答案质量评价模型[J].计算机应用研究, 2017, 34(6):1708-1712.(YUAN J, LIU Y.Answer quality evaluation model for community question answering based on hybrid method[J].Application research of computers, 2017, 34(6):1708-1712.)
[11] 胡鹏辉.基于多模型的问答社区答案质量评价研究[D].南京:南京师范大学, 2019.(HU P H.Research on answer quality evaluation of question and answer community based on multimodel[D].Nanjing:Nanjing Normal University, 2019.)
[12] 陈鹏.社区问答系统中问句分类和答案评价研究及应用[D].重庆:重庆邮电大学, 2021.(CHENG P.Research and application of question classification and answer evaluation in community question answering[D].Chongqing:Chongqing University of Posts and Telecommunications, 2021.
[13] THAKUR N, REIMERS N, DAXENBERGER J, et al.Augmented SBERT:data augmentation method for improving Bi-Encoders for pairwise sentence scoring tasks[EB/OL].arXiv, 2021[2024-01-18].http://arxiv.org/abs/2010.08240.
[14] WANG B, LIU B, WANG X, et al.Deep Learning approaches to semantic relevance modeling for Chinese question-answer pairs[J].ACM transactions on Asian language information processing, 2011, 10(4):21:1-21:16.
[15] 刘江峰, 林立涛, 刘畅, 等.深度学习驱动的海量人文社会科学学术文献学科分类研究[J].情报理论与实践, 2023, 46(2):71-81.(LIU J F, LIN L T, LIU C.Study on the discipline classification of massive humanities and social science academic literature driven by deep learning[J].Information studies:theory & application, 2023, 46(2):71-81.)
[16] 王美月.学术虚拟社区用户社会化交互行为研究[D].长春:吉林大学, 2021.(WANG M Y.Research on users' social interaction behavior in academic virtual community[D].Changchun:Jilin University, 2021.)
[17] 论文投稿-学术交流区-小木虫论坛-学术科研互动平台[EB/OL].[2024-01-17].http://muchong.com/f-125-1.(Paper submission-academic discuss area-Xiaomuchong Forum-academic scientific research interactive platform[EB/OL].[2024-01-17].http://muchong.com/f-125-1.)
[18] REIMERS N, GUREVYCH I.Sentence-BERT:sentence embeddings using Siamese BERT-Networks[C]//Proceedings of the 2019 conference on empirical methods in natural language processing.Hong Kong:Association for Computational Linguistics, 2019:3982-3992.
[19] 成全, 邓婷燕.在线母婴社区的用户健康信息需求挖掘——基于妈妈网的实证[J].现代情报, 2022, 42(5):50-57.(CHENG Q, DENG T Y.Health information needs mining of pregnant women for online maternal and infant care community:an empirical study based on mom forum of Chinese[J].Journal of modern information, 2022, 42(5):50-57.)
[20] 宋仁君.基于虚拟社区的科研人员信息需求研究[D].天津:天津工业大学, 2017.(SONG R J.Research on the information needs of researchers in virtual community[D].Tianjin:Tiangong University, 2017.)
[21] DEVLIN J, CHANG M W, LEE K, et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics:human language technologies.Minneapolis:Association for Computational Linguistics, 2019:4171-4186.
[22] CUI Y, CHE W, LIU T, et al.Pre-Training with whole word masking for Chinese BERT[J].IEEE/ACM transactions on audio, speech, and language processing, 2021, 29:3504-3514.
[23] Bert-base-multilingual-cased·Hugging Face[EB/OL].[2023-12-21].https://huggingface.co/bert-base-multilingual-cased.
[24] SERENGIL S.A gentle introduction to cross-entropy loss function[EB/OL].[2023-12-22].https://sefiks.com/2017/12/17/a-gentle-introduction-to-cross-entropy-loss-function/.
[25] LENG Z, TAN M, LIU C, et al.PolyLoss:a polynomial expansion perspective of classification loss functions[EB/OL].[2024-01-17].http://arxiv.org/abs/2204.12511.DOI:10.48550/arXiv.2204.12511.
[26] WOLF T, DEBUT L, SANH V, et al.Transformers:state-ofthe-art natural language processing[C]//Proceedings of the 2020 conference on empirical methods in natural language processing:system demonstrations.Association for Computational Linguistics, 2020:38-45.
[27] 刘伟利, 张海涛, 李依霖, 等.基于语义网络的社会化问答社区答案聚合与排序研究[J].情报科学, 2021, 39(9):94-100.(LIU W L, ZHANG H T, LI Y L, et al.Answer aggregation and sorting of social Q&A community based on semantic network[J].Information science, 2021, 39(9):94-100.)
[28] 李蕾, 何大庆, 章成志.社会化问答研究综述[J].数据分析与知识发现, 2018, 2(7):1-12.(LI L, HE D Q, ZHANG C Z.Survey on social question and answer[J].Data analysis and knowledge discovery, 2018, 2(7):1-12.
[29] JÄRVELIN K, KEKÄLÄINEN J.Cumulated gain-based evaluation of IR techniques[J].ACM transactions on information systems, 2002, 20(4):422-446.
[30] SAKAI T.Q-Measure[M]//LIU L, ÖZSU M T.Encyclopedia of database systems.New York:Springer, 2017:1-2.
[31] REIMERS N, GUREVYCH I.Sentence-BERT:sentence embeddings using Siamese BERT-networks:arXiv:1908.10084[R].arXiv, 2019.
[32] HOFSTÄTTER S, LIN S C, YANG J H, et al.Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling[C]//Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval.New York:Association for Computing Machinery, 2021:113-122.
[33] XIONG L, XIONG C, LI Y, et al.Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval[EB/OL].arXiv, 2020[2024-01-18].http://arxiv.org/abs/2007.00808.
[34] Sentence-transformers/msmarco-distilbert-base-dot-prod-v3·Hugging Face[EB/OL].[2023-12-31].https://huggingface.co/sentence-transformers/msmarco-distilbert-base-dotprod-v3.
文章导航

/