[目的/意义] 实用的大规模地名本体数据库系统在自然语言处理、信息检索和情报分析领域具有重要的应用价值。本研究的目的是在减少人工干预的情况下,实现对地名简称、俗名以及随时间变化的复杂地名文本的自动识别与处理。[方法/过程] 以多种方法获取大规模名址数据为根基,简化地名元素间复杂关系,在开发名址元素切分、属性与关系分析及推理工具包的基础上,利用Neo4j图数据库工具开发实用地名本体数据库系统。[结果/结论] 基于所介绍的技术与方法而构建的系统具有良好的容错性和持续的数据更新能力,其地名分析、地名元素间关系推理达到了期望的精度,并在面向诸如新闻主题追踪、金融征信中的地名匹配等多种自然语言处理任务中取得良好效果。
[Purpose/significance] The geographic ontology has great value in natural language processing(NLP), information retrieval and intelligence analysis tasks. The purpose of this study is to analyze complicated address text automatically with less manual processing, such like Abbreviation, unstandardized names even changing with time.[Method/process] Previous studies have primarily focused on rigorous ontology building and languages like OWL are used to create standardized statements. In this study, we changed the way to simplify the relationship set and emphasize on obtaining and using massive data from different types of resources. Through the development of address text segmentation and attribute annotation as well as other relationship reasoning software toolkit, we generate a large-scale geographic ontology database by using Neo4j graph database software.[Result/conclusion] The system based on the methods and technologies introduced in this paper has abilities of fault-tolerant and long-lasting data growth and renewal. The precision of toponym analysis and geographic elements relationship reasoning achieved the desired requirements and led to the success of many NLP tasks, such like news topic tracking, black address lists comparison for credit investigation and so on.
[1] 李宏伟. 基于Ontology的地理信息服务研究[D]. 郑州:解放军信息工程大学,2007.
[2] 李丽双,黄德根,陈春荣,等.SVM与规则相结合的中文地名自动识别[J].中文信息学报,2006,20(5):51-57.
[3] 李丽双,党延忠,廖文平,等.CRF与规则相结合的中文地名识别[J].大连理工大学学报,2012,52(2):285-289.
[4] 钱晶,张杰,张涛.基于最大熵的汉语人名地名识别方法研究[J].小型微型计算机系统,2006,27(9):1761-1765.
[5] 黄水清,王东波,何琳.基于先秦语料库的古汉语地名自动识别模型构建研究[J].图书情报工作,2015,59(12):135-140.
[6] 朱锁玲,包平.方志类古籍地名识别及系统构建[J].中国图书馆学报,2011,37(3):118-124.
[7] 柯修,王惠临,于薇. 基于串频统计的汉语和孟加拉语专有名词识别[J]. 现代图书情报技术,2011(12):31-38.
[8] SMITH B,MARK D M. Ontology with human subjects testing:an empirical investigation of geographic categories[J].American journal of economics and sociology, 1999,58(2):245-272.
[9] PURVES R S, CLOUGH P,CHRISTOPHER B J,et al.The design and implementation of SPIRIT:a spatially aware search engine for information retrieval on the Internet[J]. International journal of geographical information science.2007, 21(7):717-745.
[10] 黄茂军.地理本体的形式化表达机制及其在地图服务中的应用研究[D].武汉:武汉大学,2005.
[11] 崔巍.用本体实现地理信息系统语义集成和互操作[D].武汉:武汉大学,2004.
[12] 陈健,张斌,梁汝鹏. 地名本体的构建检验与维护[J]. 测绘科学,2014,39(4):26-29.
[13] 梁汝鹏,李淑霞,李文娟. 基于地名本体的空间数据组织与服务研究[J]. 信息工程大学学报,2010,11(2):175-179.
[14] LI W, ZHAO Q Z, HAN F. Requirement-driven approach for semantic description of crops spatial data[J].Science of surveying and mapping, 2013,38(2):119-121.
[15] 宋彦,蔡东风,张桂平,等. 一种基于字词联合解码的中文分词方法[J]. 软件学报,2009,20(9):2366-2375.
[16] CHEN Y,THOMAS A L, MEI Q, et al. A study of active learning methods for named entity recognition in clinical text[J].Journal of biomedical informatics, 2015,58:11-18.
[17] 韩冰,刘一佳,车万翔,等. 基于感知器的中文分词增量训练方法研究[J]. 中文信息学报,2015,29(5):49-54.
[18] 张海军,史树敏,朱朝勇,等. 中文新词识别技术综述[J]. 计算机科学,2010,37(3):6-10-16.
[19] YANG X Y. Study of the place names from the perspective of category theory[C].//LU Q,GAO HH. Chinese lexical semantics. Switzerland:Springer International Publishing, 2015:112-119.