图书情报工作 ›› 2013, Vol. 57 ›› Issue (10): 128-135.DOI: 10.7536/j.issn.0252-3116.2013.10.020

• 知识组织 • 上一篇    下一篇

Web、语料库与双语平行语料库的建设

熊文新   

  1. 北京外国语大学中国外语教育研究中心
  • 收稿日期:2013-01-28 修回日期:2013-03-25 出版日期:2013-05-20 发布日期:2013-05-20
  • 作者简介:熊文新,北京外国语大学中国外语教育研究中心副研究员,博士,E-mail:xiongwenxin@bfsu.edu.con。
  • 基金资助:

    本文系教育部人文社会科学研究项目"基于语料库及对应词表的英语特异组合研究"(项目编号:09YJA740013)和国家社会科学基金项目"服务信息检索的自然语言"(项目编号:11BYY051)研究成果之一。

Web, Corpus and the Building of Bilingual Parallel Corpora

Xiong Wenxin   

  1. National Research Center for Foreign Language Education, Beijing Foreign Studies University, Beijing 100089
  • Received:2013-01-28 Revised:2013-03-25 Online:2013-05-20 Published:2013-05-20

摘要:

对Web和语料库以及多语语料库的关系进行辨析,针对Web上丰富的各类电子文本,从语言工程角度出发,提出"分步骤、按领域"建设大规模双语平行语料库的思路,即选定领域专一、语言可靠、格式规范的文本,逐次建设特定领域的语料库,最后汇总成高质量、大规模、全领域的"高大全"式双语平行语料库。同时,围绕一个实例介绍如何利用Web资源建设特定领域双语平行语料库。

关键词: Web, 语料库, 子语言, 双语平行语料库, 语言资源

Abstract:

There are different understandings of Web as corpus. We try to explore the relations between Web, corpus and bilingual parallel corpora. Inspired by the rich electronic texts available on World Wide Web, and the strategy of sublanguage in language engineering, we propose a solution to building a large-scale bilingual parallel corpus, by accumulating homogeneous documents in different domains. The large amount of texts with high quality on a restricted domain collected at each step eventually constitutes a massive general-purpose balanced data warehouse. An example is elaborated to show how to construct a domain-specific bilingual parallel corpus from the Web.

Key words: Web, corpus, sublanguage, bilingual parallel corpora, language resource

中图分类号: