This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology.
This software is a translation into C++ of the excellent Webgraph library by P. Boldi and S. Vigna. The original library, written in Java, is easy to use but hampered by some requirements of the Java virtual machine. This C++ translation attempts to preserve much of the ease of use (through integration with the Boost Graph Library), but bypass requirements imposed by a virtual machine.
B. Pereira Nunes, R. Kawase, S. Dietze, D. Taibi, M. Casanova, und W. Nejdl. Proceedings of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic Web Conference, Volume 906 von CEUR-WS.org, Seite 45--57. (November 2012)
B. Pereira Nunes, R. Kawase, S. Dietze, D. Taibi, M. Casanova, und W. Nejdl. Proceedings of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic Web Conference, Volume 906 von CEUR-WS.org, Seite 45--57. (November 2012)
A. Schenker, H. Bunke, M. Last, und A. Kandel. Document Analysis Systems, Volume 3163 von Lecture Notes in Computer Science, Seite 401-412. Springer, (2004)