Evaluating Chinese Answers' Quality in the Community QA System:A Case Study of Zhihu

  • Wang Wei ,
  • Ji Yuqiang ,
  • Wang Hongwei ,
  • Zheng Lijuan
Expand
  • 1. College of Business Administration, Huaqiao University, Quanzhou 362021;
    2. School of Economics and Management, Tongji University, Shanghai 200092;
    3. School of Business, Liaocheng University, Liaocheng 252000

Received date: 2017-07-02

  Revised date: 2017-09-09

  Online published: 2017-11-20

Abstract

[Purpose/significance] Online Q&A communities have become a major way to access high quality knowledge. It is meaningful to explore the quality of the answer in the Chinese question and answer community which promotes the dissemination of knowledge.[Method/process] In this paper, we focused on the largest Chinese Q&A community-Zhihu. Data mining and machine learning, logistic regression, support vector machine and random forest algorithms were employed to build three classification models with three-level progressive training to predict the answer quality. Then we constructed a feature set including structured features, text features and social features.[Result/conclusion] The experiment results show that the performance of three classification models has been improved significantly with the continuous enrichment of the feature system. We find that the random forest model always shows better performance than other models in the same feature level. Moreover, by analyzing the different kinds of feature combination, the random forest model with social features always outperforms the models without social features, which reflects the value of the social attributes in the evaluation of the answer quality. We conclude that it is reasonable to evaluate the answer quality from the answer itself and the writer's social attributes. The feature system we build can reflect the quality of the answers in a comprehensive way.

Cite this article

Wang Wei , Ji Yuqiang , Wang Hongwei , Zheng Lijuan . Evaluating Chinese Answers' Quality in the Community QA System:A Case Study of Zhihu[J]. Library and Information Service, 2017 , 61(22) : 36 -44 . DOI: 10.13266/j.issn.0252-3116.2017.22.005

References

[1] NIE L, WEI X, ZHANG D, et al. Data-driven answer selection in communitty QA systems[J]. IEEE transactions on knowledge and data engineering, 2017, 29(6):1186-1198.
[2] YAO Y, TONG H, XIE T, et al. Detecting high-quality posts in community question answering sites[J]. Information sciences, 2015, 302(C):70-82.
[3] BURGESS S, SELLITTO C, COX C, et al. User-generated content (UGC) in tourism:benefits and concerns of online consumers[C]//European conference on information systems. Verona:DBLP, 2009:417-429.
[4] PATIL S, LEE K. Detecting experts on Quora:by their activity, quality of answers, linguistic characteristics and temporal behaviors[J]. Social network analysis and mining, 2016, 6(1):5.
[5] HOSSEINI M, MOORE J, ALMALIKI M, et al. Wisdom of the crowd within enterprises:practices and challenges[J]. Computer networks, 2015, 90(C):121-132.
[6] KIM S, OH J S, OH S. Best-answer selection criteria in a social Q&A site from the user-oriented relevance perspective[J]. Proceedings of the American Society for Information Science and Technology, 2007, 44(1):1-15.
[7] SHAH C, POMERANTZ J. Evaluating and predicting answer quality in community QA[C]//Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. Geneva:ACM Press, 2010:411-418.
[8] 张克永, 李贺. 网络健康社区知识共享的影响因素研究[J]. 图书情报工作, 2017, 61(5):109-116.
[9] SRIRAM B, FUHRY D, DEMIR E, et al. Short text classification in twitter to improve information filtering[C]//Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. Geneva:ACM Press, 2010:841-842.
[10] MCCALLUM A, ROSENFELD R, MITCHELL T M, et al. Improving text classification by shrinkage in a hierarchy of classes[C]//Fifteenth international conference on machine learning. San Francisco:Morgan Kaufmann Publishers Inc., 1998:359-367.
[11] JOYCE E, KRAUT R E. Predicting continued participation in newsgroups[J]. Journal of computer-mediated communication, 2006, 11(3):723-747.
[12] NG S H, BRADAC J J. Power in language:verbal communication and social influence[M]. Thousand Oaks:Sage Publications, 1993.
[13] ZHANG M, GUO L, HU M, et al. Influence of customer engagement with company social networks on stickiness:mediating effect of customer value creation[J]. International journal of information management, 2017, 37(3):229-240.
[14] PAN Z, LU Y, WANG B, et al. Who do you think you are? Common and differential effects of social self-identity on social media usage[J]. Journal of management information systems, 2017, 34(1):71-101.
[15] HUFFAKER D. Dimensions of leadership and social influence in online communities[J]. Human communication research, 2010, 36(4):593-617.
[16] PERRY-SMITH J E, MANNUCCI P V. From creativity to innovation:the social network drivers of the four phases of the idea journey[J]. Academy of management review, 2017, 42(1):53-79.
[17] 祝振媛. 基于信息分类的网络书评内容挖掘与整合研究[J]. 图书情报工作, 2016,60(1):114-124.
[18] LEON R D, RODRÍGUEZ-RODRÍGUEZ R, GO'MEZ-GASQUET P, et al. Social network analysis:a tool for evaluating and predicting future knowledge flows from an insurance organization[J]. Technological forecasting and social change, 2017, 114:103-118.
[19] CHUJO K, UTIYAMA M. Understanding the role of text length, sample size and vocabulary size in determining text coverage[J]. Reading in a foreign language, 2005, 17(1):1-22.
[20] MC LAUGHLIN G H. SMOG grading-a new readability formula[J]. Journal of reading, 1969, 12(8):639-646.
[21] CHAFE W. Punctuation and the prosody of written language[J]. Written communication, 1988, 5(4):395-426.
[22] ZHANG L, HUANG C, ZHOU M, et al. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm[C]//Proceedings of the 38th annual meeting on Association for Computational Linguistics. New York:ACM Press, 2000:248-254.
[23] WALKER S, SCHLOSS P, FLETCHER C R, et al. Visual-syntactic text formatting:a new method to enhance online reading[J]. Reading online, 2005, 8(6):1096-1232.
[24] METZGER M J. Making sense of credibility on the Web:models for evaluating online information and recommendations for future research[J]. Journal of the American Society for Information Science and Technology, 2007, 58(13):2078-2091.
[25] LEEBRON E J. Visual persuasion:the role of images in advertising[J]. Journal of broadcasting & electronic media, 1997, 41(4):589-593.
[26] KAKOL M, NIELEK R, WIERZBICKI A. Understanding and predicting web content credibility using the content credibility corpus[J]. Information processing & management, 2017, 53(5):1043-1061.
[27] 李展, 巢文涵, 陈小明, 等. 中文社区问答中问题答案质量评价和预测[J]. 计算机科学, 2011, 38(6):230-236.
[28] LI Y, MA S, ZHANG Y, et al. An improved mix framework for opinion leader identification in online learning communities[J]. Knowledge-based systems, 2013, 43(2):43-51.
[29] MARTENS D, VANTHIENEN J, VERBEKE W, et al. Performance of classification models from a user perspective[J]. Decision support systems, 2011, 51(4):782-793.
[30] CAO P, LIU X, YANG J, et al. A multi-kernel based framework for heterogeneous feature selection and over-sampling for computer-aided detection of pulmonary nodules[J]. Pattern recognition, 2017, 64:327-346.
[31] 刘敏娟, 张学福, 颜蕴. 基于词频、词量、累积词频占比的共词分析词集范围选取方法研究[J]. 图书情报工作, 2016, 60(23):135-142.
[32] KELLEY J, STEWART C, MORRIS N, et al. Obtaining and managing answer quality for online data-intensive services[J]. ACM transactions on modeling and performance evaluation of computing systems, 2015, 2(2):167-176.
[33] SHEN H, LIU G, WANG H, et al. Social Q&A:an online social network based question and answer system[J]. IEEE transactions on big data, 2017, 3(1):91-106.
[34] SAVCHUK O Y, HART J D. Fully robust one-sided cross-validation for regression functions[J]. Computational statistics, 2017, 32(3):1003-1025.
[35] BAI S. Growing random forest on deep convolutional neural networks for scene categorization[J]. Expert systems with applications, 2017, 71(C):279-287.
[36] BREIMAN L. Random Forest[J]. Machine Learning, 2001(1), 45:5-32.
[37] MENZE B H, KELM B M, MASUCH R, et al. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data[J]. BMC bioinformatics, 2009, 10(1):1-16.
Outlines

/