图书情报工作 ›› 2017, Vol. 61 ›› Issue (22): 36-44.DOI: 10.13266/j.issn.0252-3116.2017.22.005

• 理论研究 • 上一篇    下一篇

中文问答社区答案质量的评价研究:以知乎为例

王伟1, 冀宇强2, 王洪伟2, 郑丽娟3   

  1. 1. 华侨大学工商管理学院 泉州 362021;
    2. 同济大学经济与管理学院 上海 200092;
    3. 聊城大学商学院 聊城 252000
  • 收稿日期:2017-07-02 修回日期:2017-09-09 出版日期:2017-11-20 发布日期:2017-11-20
  • 通讯作者: 王洪伟(ORCID:0000-0003-0814-3498),教授,博士,博士生导师,通讯作者,E-mail:hwwang@tongji.edu.cn
  • 作者简介:王伟(ORCID:0000-0001-5981-7312),讲师,博士;冀宇强(ORCID:0000-0003-0478-4706),工程师,硕士;郑丽娟(ORCID:0000-0002-8182-6765),讲师,博士。
  • 基金资助:
    本文系国家自然科学基金项目"文本语言特征对众筹项目融资效果的影响:基于文本挖掘的方法"(项目编号:71601082)和国家自然科学基金项目"基于在线评论文本挖掘的线上线下服务补救:以网络零售为例"(项目编号:71701085)研究成果之一。

Evaluating Chinese Answers' Quality in the Community QA System:A Case Study of Zhihu

Wang Wei1, Ji Yuqiang2, Wang Hongwei2, Zheng Lijuan3   

  1. 1. College of Business Administration, Huaqiao University, Quanzhou 362021;
    2. School of Economics and Management, Tongji University, Shanghai 200092;
    3. School of Business, Liaocheng University, Liaocheng 252000
  • Received:2017-07-02 Revised:2017-09-09 Online:2017-11-20 Published:2017-11-20

摘要: [目的/意义]在线问答社区成为互联网用户获取高质量知识的重要途径,探索中文问答社区答案质量对知识传播具有重要意义。[方法/过程]以规模最大的中文问答社区之一"知乎"为研究对象,采用数据挖掘和机器学习方法,选取逻辑回归、支持向量机和随机森林三种分类模型,进行三层递进式训练和检验。从结构化特征、文本特征以及用户社交属性三个维度构建答案质量的特征体系。[结果/结论]实验结果显示,随着特征体系的不断丰富,三种分类模型的性能逐步提升;而随机森林作为一种组合分类模型,在全量特征的情况下,取得出色的分类性能。对特征组合分析发现,包含用户社交属性的随机森林总是比同等级的其它模型更加出色,表明社会化网络在答案质量评价中的地位。研究结论表明从答案本身和答案编写者两个角度能够评价答案质量,构建的特征体系和模型可以较为全面地预测答案质量。

关键词: 答案质量, 质量评价, 机器学习, 文本挖掘, 知乎

Abstract: [Purpose/significance] Online Q&A communities have become a major way to access high quality knowledge. It is meaningful to explore the quality of the answer in the Chinese question and answer community which promotes the dissemination of knowledge.[Method/process] In this paper, we focused on the largest Chinese Q&A community-Zhihu. Data mining and machine learning, logistic regression, support vector machine and random forest algorithms were employed to build three classification models with three-level progressive training to predict the answer quality. Then we constructed a feature set including structured features, text features and social features.[Result/conclusion] The experiment results show that the performance of three classification models has been improved significantly with the continuous enrichment of the feature system. We find that the random forest model always shows better performance than other models in the same feature level. Moreover, by analyzing the different kinds of feature combination, the random forest model with social features always outperforms the models without social features, which reflects the value of the social attributes in the evaluation of the answer quality. We conclude that it is reasonable to evaluate the answer quality from the answer itself and the writer's social attributes. The feature system we build can reflect the quality of the answers in a comprehensive way.

Key words: answer quality, quality evaluation, machine learning, text mining, Zhihu

中图分类号: