TY - JOUR AU - Larsen, P O AU - von Ins, M T1 - The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index JO - Scientometrics PY - 2010/10 VL - 84 IS - 3 SP - 575 EP - 603 UR - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909426/ DO - 10.1007/s11192-010-0202-z KW - citation KW - corpus KW - growth KW - index KW - publication KW - scientific L1 - SN - N1 - The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index N1 - AB - The growth rate of scientific publication has been studied from 1907 to 2007 using available data from a number of literature databases, including Science Citation Index (SCI) and Social Sciences Citation Index (SSCI). Traditional scientific publishing, that is publication in peer-reviewed journals, is still increasing although there are big differences between fields. There are no indications that the growth rate has decreased in the last 50 years. At the same time publication using new channels, for example conference proceedings, open archives and home pages, is growing fast. The growth rate for SCI up to 2007 is smaller than for comparable databases. This means that SCI was covering a decreasing part of the traditional scientific literature. There are also clear indications that the coverage by SCI is especially low in some of the scientific areas with the highest growth rate, including computer science and engineering sciences. The role of conference proceedings, open access archives and publications published on the net is increasing, especially in scientific fields with high growth rates, but this has only partially been reflected in the databases. The new publication channels challenge the use of the big databases in measurements of scientific productivity or output and of the growth rate of science. Because of the declining coverage and this challenge it is problematic that SCI has been used and is used as the dominant source for science indicators based on publication and citation numbers. The limited data available for social sciences show that the growth rate in SSCI was remarkably low and indicate that the coverage by SSCI was declining over time. National Science Indicators from Thomson Reuters is based solely on SCI, SSCI and Arts and Humanities Citation Index (AHCI). Therefore the declining coverage of the citation databases problematizes the use of this source. ER - TY - JOUR AU - Denoyer, Ludovic AU - Gallinari, Patrick T1 - The Wikipedia XML Corpus JO - SIGIR Forum PY - 2006/ VL - IS - SP - EP - UR - http://www-connex.lip6.fr/~denoyer/wikipediaXML/ DO - KW - data KW - dm KW - mining KW - xml KW - corpus KW - ml KW - wikipedia L1 - SN - N1 - Wikipedia XML Corpus N1 - AB - ER - TY - CONF AU - Liu, Vinci AU - Curran, James R. A2 - T1 - Web Text Corpus for Natural Language Processing. T2 - EACL PB - The Association for Computer Linguistics C1 - PY - 2006/ CY - VL - IS - SP - EP - UR - http://dblp.uni-trier.de/db/conf/eacl/eacl2006.html#LiuC06 DO - KW - corpus KW - dataset KW - web KW - synonym_detection KW - nlp L1 - SN - 1-932432-59-0 N1 - dblp N1 - AB - ER - TY - JOUR AU - Lewis, D. D. AU - Yang, Y. AU - Rose, T. G. AU - Li, F. T1 - RCV1: A New Benchmark Collection for Text Categorization Research JO - Journal of Machine Learning Research PY - 2004/ VL - 5 IS - Apr SP - 361 EP - 397 UR - http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf DO - KW - benchmark KW - text KW - classification KW - RCV1 KW - reuters KW - corpus L1 - SN - N1 - N1 - AB - ER - TY - CONF AU - Halevy, Alon Y. AU - Madhavan, Jayant A2 - Gottlob, Georg A2 - Walsh, Toby T1 - Corpus-Based Knowledge Representation T2 - IJCAI-03, Proceedings of the Eighteenth International Joint Conference

on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003 PB - Morgan Kaufmann C1 - PY - 2003/ CY - VL - IS - SP - 1567 EP - 1572 UR - DO - KW - representation KW - knowledge KW - based KW - corpus L1 - SN - N1 - N1 - AB - ER - TY - JOUR AU - Resnik, Philip AU - Smith, Noah A. T1 - The Web as a parallel corpus JO - Computational Linguistics PY - 2003/10 VL - 29 IS - 3 SP - 349 EP - 380 UR - http://dx.doi.org/10.1162/089120103322711578 DO - 10.1162/089120103322711578 KW - corpus KW - language KW - linguistics KW - web L1 - SN - N1 - N1 - AB - Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair. ER - TY - JOUR AU - Jiang, Jay J. AU - Conrath, David W. T1 - Semantic similarity based on corpus statistics and lexical taxonomy JO - CoRR PY - 1997/ VL - cmp-lg/9709008 IS - SP - EP - UR - DO - KW - corpus KW - semantic KW - similarity KW - wordnet L1 - SN - N1 - N1 - AB - ER - TY - CONF AU - Hearst, Marti A. A2 - T1 - Automatic acquisition of hyponyms from large text corpora T2 - Proceedings of the 14th conference on Computational linguistics PB - Association for Computational Linguistics C1 - Stroudsburg, PA, USA PY - 1992/ CY - VL - 2 IS - SP - 539 EP - 545 UR - http://dx.doi.org/10.3115/992133.992154 DO - 10.3115/992133.992154 KW - corpus KW - hearst KW - learning KW - linguistics KW - ontology KW - pattern KW - text L1 - SN - N1 - N1 - AB - We describe a method for the automatic acquisition of the hyponymy lexical relation from unrestricted text. Two goals motivate the approach: (i) avoidance of the need for pre-encoded knowledge and (ii) applicability across a wide range of text. We identify a set of lexico-syntactic patterns that are easily recognizable, that occur frequently and across text genre boundaries, and that indisputably indicate the lexical relation of interest. We describe a method for discovering these patterns and suggest that other lexical relations will also be acquirable in this way. A subset of the acquisition algorithm is implemented and the results are used to augment and critique the structure of a large hand-built thesaurus. Extensions and applications to areas such as information retrieval are suggested. ER - TY - BOOK AU - Francis, Winthrop Nelson AU - Kucera, Henry A2 - T1 - Frequency Analysis of English Usage: Lexicon and Grammar PB - Houghton Mifflin C1 - PY - 1983/ VL - IS - SP - EP - UR - DO - KW - brown KW - corpus L1 - SN - N1 - N1 - AB - ER -