20 Newsgroups
Abstract
This data set consists of 20000 messages taken from 20 Usenet newsgroups.
Information files:
description of the data
Data files:
20_newsgroups.tar.gz (17.3M; 61.6M uncompressed)
mini_newsgroups.tar.gz A subset composed of 100 articles from each newsgroup. (1.9M; 6.2M uncompressed)
This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
Following a successful first edition, we are pleased to announce the 2nd edition of the Large Scale Hierarchical Text Classification (LSHTC) Pascal Challenge. The LSHTC Challenge is a hierarchical text classification competition, using large datasets. This year’s challenge will increase the scale and the difficulty of the task, using data from Wikipedia (www.wikipedia.org), in addition to the ODP Web directory data (www.dmoz.org).
Find and download data in any format, from financial to social networking to GIS data. Or sell data in our data marketplace, at a price you set. We have large data sets, spreadsheets, and databases packed with statistics.
A number of resources have been compiled within the context of the MuchMore project. These include: a bilingual, parallel medical corpus; corresponding queries and relevance assessments; evaluation sets of disambiguated terms for GermaNet and UMLS; an evaluation list for morphological decomposition of medical terms.
S-Match is an open source Java framework for semantic matching. It contains semantic matching, minimal semantic matching and structure preserving semantic matching algorithm implementations.
Scientext is a new, on-line French and English corpus of scientific texts. The corpus includes 4.8 million running tokens in French, 13 million words of research articles in English (medicine and biology), and an English-language sub-corpus of French undergraduate students’ texts (1,1 million words). The corpus is organized to facilitate the linguistic study of authorial position and reasoning in scientific articles through phraseology and lexico-grammatical markers linked to causality.