Tweets2011 As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
Wmatrix is a software tool for corpus analysis and comparison. It provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains.
A number of resources have been compiled within the context of the MuchMore project. These include: a bilingual, parallel medical corpus; corresponding queries and relevance assessments; evaluation sets of disambiguated terms for GermaNet and UMLS; an evaluation list for morphological decomposition of medical terms.
A. Halevy, und J. Madhavan. IJCAI-03, Proceedings of the Eighteenth International Joint Conference
on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003, Seite 1567-1572. Morgan Kaufmann, (2003)