The Web as a parallel corpus
P. Resnik, and N. Smith.
Computational Linguistics 29 (3): 349--380 (September 2003)

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

URL

http://dx.doi.org/10.1162/089120103322711578

search on

This publication has not been reviewed yet.

rating distribution

average user rating0.0 out of 5.0 based on 0 reviews

Please log in to take part in the discussion (add own reviews or comments).

@article{resnik2003parallel,
  abstract = {Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.},
  acmid = {964753},
  added-at = {2012-09-26T17:03:46.000+0200},
  address = {Cambridge, MA, USA},
  author = {Resnik, Philip and Smith, Noah A.},
  biburl = {https://puma.uni-kassel.de/bibtex/22fdc5044a0d669f6766edaaceaae2bc3/jaeschke},
  doi = {10.1162/089120103322711578},
  interhash = {b23f5b4586fb7dd07c28c376b08c0eda},
  intrahash = {2fdc5044a0d669f6766edaaceaae2bc3},
  issn = {0891-2017},
  issue_date = {September 2003},
  journal = {Computational Linguistics},
  keywords = {corpus language linguistics web},
  month = sep,
  number = 3,
  numpages = {32},
  pages = {349--380},
  publisher = {MIT Press},
  timestamp = {2012-09-26T17:03:46.000+0200},
  title = {The Web as a parallel corpus},
  url = {http://dx.doi.org/10.1162/089120103322711578},
  volume = 29,
  year = 2003
}

%0 Journal Article
%1 resnik2003parallel
%A Resnik, Philip
%A Smith, Noah A.
%C Cambridge, MA, USA
%D 2003
%I MIT Press
%J Computational Linguistics
%K corpus language linguistics web
%N 3
%P 349--380
%R 10.1162/089120103322711578
%T The Web as a parallel corpus
%U http://dx.doi.org/10.1162/089120103322711578
%V 29
%X Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

PUMA

The Web as a parallel corpus
P. Resnik, and N. Smith.
Computational Linguistics 29 (3): 349--380 (September 2003)

Tags

Users

Comments and Reviews

Cite this publication

PUMA

The Web as a parallel corpusP. Resnik, and N. Smith. Computational Linguistics 29 (3): 349--380 (September 2003)

Tags

Users

Comments and Reviews

Cite this publication

The Web as a parallel corpus
P. Resnik, and N. Smith.
Computational Linguistics 29 (3): 349--380 (September 2003)