M. Carman, M. Baillie, R. Gwadera, und F. Crestani. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, Seite 123--130. New York, NY, USA, ACM, (2009)
We investigate tag and query logs to see if the terms people use to annotate websites are similar to the ones they use to query for them. Over a set of URLs, we compare the distribution of tags used to annotate each URL with the distribution of query terms for clicks on the same URL. Understanding the relationship between the distributions is important to determine how useful tag data may be for improving search results and conversely, query data for improving tag prediction. In our study, we compare both term frequency distributions using vocabulary overlap and relative entropy. We also test statistically whether the term counts come from the same underlying distribution. Our results indicate that the vocabulary used for tagging and searching for content are similar but not identical. We further investigate the content of the websites to see which of the two distributions (tag or query) is most similar to the content of the annotated/searched URL. Finally, we analyze the similarity for different categories of URLs in our sample to see if the similarity between distributions is dependent on the topic of the website or the popularity of the URL.