Wie groß ist das Internet? Ein unbekannter Hacker beantwortet diese Frage jetzt - mit effektiven, aber illegalen Mitteln: Er verschaffte sich Zugriff auf Hunderttausende Router und nutzte sie als Forschungssonde. Das Ergebnis ist ein einzigartiges Abbild des Internets von heute.
A collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks. The collection represents a collaboration between LabROSA and The Echo Nest. More details, background, and instructions on how to use the datasets can be found at LabROSA’s site. The goal of sharing this data on Infochimps is to provide a large dataset for research and to encourage large-scale algorithms surrounding the data. There is one dataset for each letter of the alphabet (A-Z) containing data for all songs that start with that letter, one dataset of additional files, and a small sample dataset. Each of the datasets for each letter consists of song files in the HDF5 format. Most of the data is licensed the same way as Echo Nest’s API. The code is under GNU public license.
Kaggle is a platform for data prediction competitions. Companies, organizations and researchers post their data and have it scrutinized by the world's best statisticians.
Tweets2011 As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
Pearson Longman English Language Teaching (Pearson Longman ELT) is a leading educational publisher of quality resources for all ages and abilities across the curriculum, providing solutions for teachers and students.
Find and download data in any format, from financial to social networking to GIS data. Or sell data in our data marketplace, at a price you set. We have large data sets, spreadsheets, and databases packed with statistics.
Following a successful first edition, we are pleased to announce the 2nd edition of the Large Scale Hierarchical Text Classification (LSHTC) Pascal Challenge. The LSHTC Challenge is a hierarchical text classification competition, using large datasets. This year’s challenge will increase the scale and the difficulty of the task, using data from Wikipedia (www.wikipedia.org), in addition to the ODP Web directory data (www.dmoz.org).
Scientext is a new, on-line French and English corpus of scientific texts. The corpus includes 4.8 million running tokens in French, 13 million words of research articles in English (medicine and biology), and an English-language sub-corpus of French undergraduate students’ texts (1,1 million words). The corpus is organized to facilitate the linguistic study of authorial position and reasoning in scientific articles through phraseology and lexico-grammatical markers linked to causality.
Y. Song, L. Zhang, und C. Giles. CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge mining, Seite 93--102. New York, NY, USA, ACM, (2008)