TY  - JOUR
AU  - Larsen, P O
AU  - von Ins, M
T1  - The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index
JO  - Scientometrics
PY  - 2010/10
VL  - 84
IS  - 3
SP  - 575
EP  - 603
UR  - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909426/
DO  - 10.1007/s11192-010-0202-z
KW  - citation
KW  - corpus
KW  - growth
KW  - index
KW  - publication
KW  - scientific
L1  - 
SN  - 
N1  - The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index
N1  - 
AB  - The growth rate of scientific publication has been studied from 1907 to 2007 using available data from a number of literature databases, including Science Citation Index (SCI) and Social Sciences Citation Index (SSCI). Traditional scientific publishing, that is publication in peer-reviewed journals, is still increasing although there are big differences between fields. There are no indications that the growth rate has decreased in the last 50 years. At the same time publication using new channels, for example conference proceedings, open archives and home pages, is growing fast. The growth rate for SCI up to 2007 is smaller than for comparable databases. This means that SCI was covering a decreasing part of the traditional scientific literature. There are also clear indications that the coverage by SCI is especially low in some of the scientific areas with the highest growth rate, including computer science and engineering sciences. The role of conference proceedings, open access archives and publications published on the net is increasing, especially in scientific fields with high growth rates, but this has only partially been reflected in the databases. The new publication channels challenge the use of the big databases in measurements of scientific productivity or output and of the growth rate of science. Because of the declining coverage and this challenge it is problematic that SCI has been used and is used as the dominant source for science indicators based on publication and citation numbers. The limited data available for social sciences show that the growth rate in SSCI was remarkably low and indicate that the coverage by SSCI was declining over time. National Science Indicators from Thomson Reuters is based solely on SCI, SSCI and Arts and Humanities Citation Index (AHCI). Therefore the declining coverage of the citation databases problematizes the use of this source.
ER  -

TY  - JOUR
AU  - Denoyer, Ludovic
AU  - Gallinari, Patrick
T1  - The Wikipedia XML Corpus
JO  - SIGIR Forum
PY  - 2006/
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://www-connex.lip6.fr/~denoyer/wikipediaXML/
DO  - 
KW  - data
KW  - dm
KW  - mining
KW  - xml
KW  - corpus
KW  - ml
KW  - wikipedia
L1  - 
SN  - 
N1  - Wikipedia XML Corpus
N1  - 
AB  - 
ER  -

TY  - CONF
AU  - Liu, Vinci
AU  - Curran, James R.
A2  - 
T1  - Web Text Corpus for Natural Language Processing.
T2  - EACL
PB  - The Association for Computer Linguistics
C1  - 
PY  - 2006/
CY  -  
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://dblp.uni-trier.de/db/conf/eacl/eacl2006.html#LiuC06
DO  - 
KW  - corpus
KW  - dataset
KW  - web
KW  - synonym_detection
KW  - nlp
L1  - 
SN  - 1-932432-59-0
N1  - dblp
N1  - 
AB  - 
ER  -

TY  - JOUR
AU  - Lewis, D. D.
AU  - Yang, Y.
AU  - Rose, T. G.
AU  - Li, F.
T1  - RCV1: A New Benchmark Collection for Text Categorization Research
JO  - Journal of Machine Learning Research
PY  - 2004/
VL  - 5
IS  - Apr
SP  - 361
EP  - 397
UR  - http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf
DO  - 
KW  - benchmark
KW  - text
KW  - classification
KW  - RCV1
KW  - reuters
KW  - corpus
L1  - 
SN  - 
N1  - 
N1  - 
AB  - 
ER  -

TY  - CONF
AU  - Halevy, Alon Y.
AU  - Madhavan, Jayant
A2  - Gottlob, Georg
A2  - Walsh, Toby
T1  - Corpus-Based Knowledge Representation
T2  - IJCAI-03, Proceedings of the Eighteenth International Joint Conference<p>               on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003
PB  - Morgan Kaufmann
C1  - 
PY  - 2003/
CY  -  
VL  - 
IS  - 
SP  - 1567
EP  - 1572
UR  - 
DO  - 
KW  - representation
KW  - knowledge
KW  - based
KW  - corpus
L1  - 
SN  - 
N1  - 
N1  - 
AB  - 
ER  -

TY  - JOUR
AU  - Resnik, Philip
AU  - Smith, Noah A.
T1  - The Web as a parallel corpus
JO  - Computational Linguistics
PY  - 2003/10
VL  - 29
IS  - 3
SP  - 349
EP  - 380
UR  - http://dx.doi.org/10.1162/089120103322711578
DO  - 10.1162/089120103322711578
KW  - corpus
KW  - language
KW  - linguistics
KW  - web
L1  - 
SN  - 
N1  - 
N1  - 
AB  - Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web,first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.
ER  -

TY  - JOUR
AU  - Jiang, Jay J.
AU  - Conrath, David W.
T1  - Semantic similarity based on corpus statistics and lexical taxonomy
JO  - CoRR
PY  - 1997/
VL  - cmp-lg/9709008
IS  - 
SP  - 
EP  - 
UR  - 
DO  - 
KW  - corpus
KW  - semantic
KW  - similarity
KW  - wordnet
L1  - 
SN  - 
N1  - 
N1  - 
AB  - 
ER  -

TY  - CONF
AU  - Hearst, Marti A.
A2  - 
T1  - Automatic acquisition of hyponyms from large text corpora
T2  - Proceedings of the 14th conference on Computational linguistics
PB  - Association for Computational Linguistics
C1  - Stroudsburg, PA, USA
PY  - 1992/
CY  -  
VL  - 2
IS  - 
SP  - 539
EP  - 545
UR  - http://dx.doi.org/10.3115/992133.992154
DO  - 10.3115/992133.992154
KW  - corpus
KW  - hearst
KW  - learning
KW  - linguistics
KW  - ontology
KW  - pattern
KW  - text
L1  - 
SN  - 
N1  - 
N1  - 
AB  - We describe a method for the automatic acquisition of the hyponymy lexical relation from unrestricted text. Two goals motivate the approach: (i) avoidance of the need for pre-encoded knowledge and (ii) applicability across a wide range of text. We identify a set of lexico-syntactic patterns that are easily recognizable, that occur frequently and across text genre boundaries, and that indisputably indicate the lexical relation of interest. We describe a method for discovering these patterns and suggest that other lexical relations will also be acquirable in this way. A subset of the acquisition algorithm is implemented and the results are used to augment and critique the structure of a large hand-built thesaurus. Extensions and applications to areas such as information retrieval are suggested.
ER  -

TY  - BOOK
AU  - Francis, Winthrop Nelson
AU  - Kucera, Henry
A2  - 
T1  - Frequency Analysis of English Usage: Lexicon and Grammar
PB  - Houghton Mifflin
C1  - 
PY  - 1983/
VL  - 
IS  - 
SP  - 
EP  - 
UR  - 
DO  - 
KW  - brown
KW  - corpus
L1  - 
SN  - 
N1  - 
N1  - 
AB  - 
ER  -