Research on Domain Adaptation Technology of Chinese Science and Technology Literatures Segmentation

  • Shi Chongde ,
  • Qiao Xiaodong ,
  • Wang Huilin ,
  • Qu Peng
Expand
  • Institute of Scientific and Technical Information of China, Beijing 100038

Received date: 2014-07-24

  Revised date: 2014-09-01

  Online published: 2014-10-05

Abstract

Segmentation of science and technology (S&T) literature is a basic step in S&T documents information processing. This paper takes biomedical literatures as the instances and studies domain adaptation technology in segmentation of S&T literatures. Then it takes some methods such as dictionary features, domain character features, sub-word tagging and low quality in-domain training corpus based on dictionary-based segmentation to adapt Chinese segmentation method based on sequence labeling in journalism filed to S&T filed and achieves the significant improvement. It finds that how to exploit domain specific features with domain knowledge plays an important role in improving the segmentation quality of S&T literatures.

Cite this article

Shi Chongde , Qiao Xiaodong , Wang Huilin , Qu Peng . Research on Domain Adaptation Technology of Chinese Science and Technology Literatures Segmentation[J]. Library and Information Service, 2014 , 58(19) : 13 -18 . DOI: 10.13266/j.issn.0252-3116.2014.19.002

References

[1] Xue Nianwen, Shen Libin. Chinese word segmentation as LMR tagging[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Sapporo: Association for Computational Linguistics, 2003:176-179.

[2] Low JinKiat, Ng HweeTou, Guo Wenyuan. A maximum entropy approach to chinese word segmentation[C]//Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.Jeju Island: Asian Federation of Natural Language Processing, 2005:161-164.

[3] Zhao Hai, Huang Changning, Li Mu. An improved chinese word segmentation system with conditional random field[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney: Association for Computational Linguistics, 2006:162-165.

[4] Jiang Jing. A Literature survey on domain adaptation of statistical classifiers[M].[2014-07-01]. http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/-survey/.

[5] Søgarrd A. Semi-supervised learning and domain adaptation in natural language processing[J]. Synthesis Lectures on Human Language Technologies, 2013,6(2):1-103.

[6] Blitzer J. Domain Adaptation of natural language processing systems[D]. Philadelphia:University of Pennsylvania. 2008.

[7] Pan SinnoJialin, Yang Qiang. A survey on transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.

[8] Japkowicz N, Stephen S. The class imbalance problem: A systematic study[J]. Intelligent Data Analysis, 2002, 6(6):429-449.

[9] Ando R K, Zhang Tong. A framework for learning predictive structures from multiple tasks and unlabeled data[J]. Journal of Machine Learning Research, 2005, 6(6):1817-1853.

[10] Blitzer J, Mcdonald R, Pereira F. Domain adaptation with structural correspondence learning[C]//The 2006 Conference on Empirical Methods in Natural Language Processing.Sydney: Association for Computational Linguistics, 2006:120-128.

[11] Zeng Daniel, Wei Donghua,Chau Michael, et al. Domain-specific Chinese word segmentation using suffix tree and mutual information[J]. Information Systems Frontiers, 2011,13(1):115-125.

[12] Chang Baobao. Enhancing domain portability of chinese segmentation model using chi-square statistics and bootstrapping[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.Cambridge: Association for Computational Linguistics,2010:789-798.

[13] Liu Yang, Zhang Yue. Unsupervised domain adaptation for joint segmentation and POS-tagging[C]//Proceedings of COLING 2012: Posters. Mumbai: The COLING 2012 Organizing Committee, 2012: 745-754.

[14] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3):8-19.

[15] Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning.San Francisco: Morgan Kaufmann Publishers Inc, 2001:282-289.

[16] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术,2012(4):29-34.

[17] 李楠, 郑荣廷, 吉久明,等. 基于启发式规则的中文化学物质命名识别研究[J]. 现代图书情报技术, 2010(5):13-17.

[18] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the Eleventh Annual Conference on Computational Learning Theory.New York: ACM, 1998:92-100.

[19] CC-CEDICT[M/OL].[2014-07-01]. http://cc-cedict.org/wiki/.

Outlines

/