Discovering URLs through user feedback
X. Bai, B. Cambazoglu, und F. Junqueira.
Proceedings of the 20th ACM international conference on Information and knowledge management, Seite 77--86. New York, NY, USA, ACM, (2011)

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.

URL

http://doi.acm.org/10.1145/2063576.2063592

Suchen auf

Diese Publikation wurde noch nicht bewertet.

Bewertungsverteilung

Durchschnittliche Benutzerbewertung0,0 von 5.0 auf Grundlage von 0 Rezensionen

Bitte melden Sie sich an um selbst Rezensionen oder Kommentare zu erstellen.

@inproceedings{bai2011discovering,
  abstract = {Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.},
  acmid = {2063592},
  added-at = {2012-09-06T10:23:10.000+0200},
  address = {New York, NY, USA},
  author = {Bai, Xiao and Cambazoglu, B. Barla and Junqueira, Flavio P.},
  biburl = {https://puma.uni-kassel.de/bibtex/24e73c9d6ed79931ccdfcfda938e3be62/jaeschke},
  booktitle = {Proceedings of the 20th ACM international conference on Information and knowledge management},
  doi = {10.1145/2063576.2063592},
  interhash = {dfef0e1af73b9c9e5096a2118368ad21},
  intrahash = {4e73c9d6ed79931ccdfcfda938e3be62},
  isbn = {978-1-4503-0717-8},
  keywords = {crawling crowd feedback search user web},
  location = {Glasgow, Scotland, UK},
  numpages = {10},
  pages = {77--86},
  publisher = {ACM},
  timestamp = {2012-09-26T11:51:52.000+0200},
  title = {Discovering URLs through user feedback},
  url = {http://doi.acm.org/10.1145/2063576.2063592},
  year = 2011
}

%0 Conference Paper
%1 bai2011discovering
%A Bai, Xiao
%A Cambazoglu, B. Barla
%A Junqueira, Flavio P.
%B Proceedings of the 20th ACM international conference on Information and knowledge management
%C New York, NY, USA
%D 2011
%I ACM
%K crawling crowd feedback search user web
%P 77--86
%R 10.1145/2063576.2063592
%T Discovering URLs through user feedback
%U http://doi.acm.org/10.1145/2063576.2063592
%X Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.
%@ 978-1-4503-0717-8

PUMA

Discovering URLs through user feedback
X. Bai, B. Cambazoglu, und F. Junqueira.
Proceedings of the 20th ACM international conference on Information and knowledge management, Seite 77--86. New York, NY, USA, ACM, (2011)

Tags

Nutzer

Kommentare und Rezensionen

Zitieren Sie diese Publikation

PUMA

Discovering URLs through user feedbackX. Bai, B. Cambazoglu, und F. Junqueira. Proceedings of the 20th ACM international conference on Information and knowledge management, Seite 77--86. New York, NY, USA, ACM, (2011)

Tags

Nutzer

Kommentare und Rezensionen

Zitieren Sie diese Publikation

Discovering URLs through user feedback
X. Bai, B. Cambazoglu, und F. Junqueira.
Proceedings of the 20th ACM international conference on Information and knowledge management, Seite 77--86. New York, NY, USA, ACM, (2011)