[目的/意义]数据科学作为一个融合诸多领域的新兴交叉学科正在快速形成。从数据科学招聘的公告信息中,抽取出相应的实体知识不仅有助于从市场的角度了解数据科学的发展动态,而且有助于改进数据科学教学的内容。[方法/过程]基于各大招聘网站职位招聘公告,结合情报学的数据获取、标注和组织方法,构建数据科学招聘语料库并从中抽取相应的实体进行分析与研究。[结果/结论]在搜集到的11 000篇经过标注的职位招聘公告语料的基础上,基于Bi-LSTM-CRF、CRF和Bi-LSTM模型,对数据科学招聘实体的抽取任务进行性能的对比,确定最终的数据科学招聘实体自动抽取模型,设计数据科学招聘实体自动抽取平台,并构建数据科学招聘实体网络。
[Purpose/significance] Data science is emerging as a new interdisciplinary field which combines many fields. Extracting the corresponding entities knowledge from the announcement information of data science recruitment can not only help to understand the development of data science from a market perspective, but also help to improve the content of data science teaching.[Method/process] Based on the recruitment announcement from the recruitment website, combining with information science data collection, annotation and organization methods, data science corpus was constructed and the corresponding entities from it were extracted.[Result/conclusion] In the existing 11000 annotated data science corpus scale recruitment announcement, based on the Bi-LSTM-CRF, CRF and Bi-LSTM models, this paper compared the extraction performance of data science recruiting entities and finally determined the final data science recruitment entities automatic extraction model, designed the data science recruitment entities automatic extraction platform, and built a data science recruitment entities network.
[1] BIKEL D M, SCHWARTZ R, WEISCHEDEL R M. An algorithm that learns what's in a name[J]. Machine learning, 1999, 34(1/3):211-231.
[2] BERGER A L, PIETRA V J D, PIETRA S A D. A maximum entropy approach to natural language processing[J]. Computational linguistics, 1996, 22(1):39-71.
[3] LAFFERTY J, MC CALLUM A, PRREIRA F. Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the eighteenth international conference on machine learning. San Francisco:Margan Kaufmann, 2001:282-289.
[4] MC CALLUM A, LI W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]//Proceedings of the seventh conference on natural language learning at HLT-NAACL. Association for Computational Linguistics, 2003:188-191.
[5] 张小衡,王玲玲.中文机构名的识别与分析[J].中文信息学报,1997,11(4):21-32.
[6] ZhANG Y, ZhOU J F. A trainable method for extracting Chinese entity names and their relations[C]//The Workshop on Chinese Language Processing:Held in Conjunction with the, Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2000:66-72.
[7] 郑逢强,林磊,刘秉权.《知网》在命名实体识别中的应用研究[J].中文信息学报, 2008, 22(5):97-101.
[8] 陈宇,郑德权,赵铁军.基于Deep Belief Nets的中文名实体关系抽取[J].软件学报,2012,23(10):2572-2585.
[9] 邵发,黄银阁,周兰江,等.基于实体消歧的中文实体关系抽取[J].山东大学学报(工学版),2014,44(6):32-37.
[10] 许华,刘茂福,姜丽,等.基于语言规则的病症菌实体抽取[J].武汉大学学报(理学版),2015,61(2):51-55.
[11] 冯蕴天, 张宏军, 郝文宁,等. 基于深度信念网络的命名实体识别[J]. 计算机科学, 2016, 43(4):224-230.
[12] DONG C, ZHANG J, ZONG C, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[C]//International conference on computer processing of oriental languages. New York City:Springer International Publishing,2016:239-250.
[13] 朱丹浩, 杨蕾, 王东波. 基于深度学习的中文机构名识别研究——一种汉字级别的循环神经网络方法[J]. 现代图书情报技术, 2016, 32(12):36-43.
[14] 叶鹰, 马费成. 数据科学兴起及其与信息科学的关联[J]. 情报学报, 2015(6):575-580.
[15] 杨京, 王效岳, 白如江,等. 大数据背景下数据科学分析工具现状及发展趋势[J]. 情报理论与实践, 2015, 38(3):134-137.
[16] 周傲英, 钱卫宁, 王长波. 数据科学与工程:大数据时代的新兴交叉学科[J]. 大数据, 2015, 1(2):90-99.
[17] 朝乐门, 卢小宾. 数据科学及其对信息科学的影响[J]. 情报学报, 2017, 36(8):761-771.
[18] 王曰芬, 谢清楠, 宋小康. 国外数据科学研究的回顾与展望[J]. 图书情报工作, 2016, 60(14):5-14.
[19] FREEMAN L C. Centrality in social networks conceptual clarification[J]. Social networks, 1979,1(3),215-239.