@jaeschke

Introduction to heritrix, an archival quality web crawler

, , , und . Proceedings of the 4th International Web Archiving Workshop IWAW'04, Bath, UK, (Juli 2004)

Zusammenfassung

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality webcrawler project. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases including focused and broadcrawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, extensible architecture facilitates customization and outside contribution. Now, after over a year of development, the Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls.

Links und Ressourcen

URL:
BibTeX-Schlüssel:
mohr2004introduction
Suchen auf:

Kommentare und Rezensionen  
(0)

Es gibt bisher keine Rezension oder Kommentar. Sie können eine schreiben!

Tags


Zitieren Sie diese Publikation