Publications
Creating a searchable web archive
Gomes, D.; Cruz, D.; Miranda, J.; Costa, M. & Fontes, S.
2012, Technical report, Foundation for National Scientific Computing, Portugal [pdf]
The web became a mass means of publication that has been replacing printed media. However, its information is extremely ephemeral. Currently, most of the information available on the web is less than 1 year old. There are several initiatives worldwide that struggle to archive information from the web before it vanishes. However, search mechanisms to access this information are still limited and do not satisfy their users that demand performance similar to live- web search engines. This paper presents some of the work developed to create an effi�cient and effective searchable web archive service, from data acquisition to user interface design. The results of research were applied in practice to create the Portuguese Web Archive that is publicly available since January 2010. It supports full-text search over 1 billion contents archived from 1996 to 2010. The developed software is available as an open source project.
Text Mining Scientific Papers: a Survey on FCA-based Information Retrieval Research.
Poelmans, J.; Elzinga, P.; Viaene, S.; Dedene, G. & Kuznetsov, S. O.
Perner, P., ed., 'Industrial Conference on Data Mining - Poster and Industry Proceedings', IBaI Publishing, 82-96 (2011) [pdf]
Formal Concept Analysis (FCA) is an unsupervised clustering technique and many scientific papers are devoted to applying FCA in Information Retrieval (IR) research. We collected 103 papers published between 2003-2009 which mention FCA and information retrieval in the abstract, title or keywords. Using a prototype of our FCA-based toolset CORDIET, we converted the pdf-files containing the papers to plain text, indexed them with Lucene using a thesaurus containing terms related to FCA research and then created the concept lattice shown in this paper. We visualized, analyzed and explored the literature with concept lattices and discovered multiple interesting research streams in IR of which we give an extensive overview. The core contributions of this paper are the innovative application of FCA to the text mining of scientific papers and the survey of the FCA-based IR research.
Search engines: information retrieval in practice
Croft, W. B.; Metzler, D. & Strohman, T.
2010, Addison-Wesley, Boston [pdf]
Folksonomies in Wissensrepräsentation und Information Retrieval
Peters, I.
2009, PhD thesis, Universität Düsseldorf, PhD thesis
Citation context analysis for information retrieval
Ritchie, A.
2009, Technical report, University of Cambridge, Cambridge, UK [pdf]
This thesis investigates taking words from around citations to scientific papers in order to create an enhanced document representation for improved information retrieval. This method parallels how anchor text is commonly used in Web retrieval. In previous work, words from citing documents have been used as an alternative representation of the cited document but no previous experiment has combined them with a full-text document representation and measured effectiveness in a large scale evaluation. The contributions of this thesis are twofold: firstly, we present a novel document representation, along with experiments to measure its effect on retrieval effectiveness, and, secondly, we document the construction of a new, realistic test collection of scientific research papers, with references (in the bibliography) and their associated citations (in the running text of the paper) automatically annotated. Our experiments show that the citation-enhanced document representation increases retrieval effectiveness across a range of standard retrieval models and evaluation measures. In Chapter 2, we give the background to our work, discussing the various areas from which we draw together ideas: information retrieval, particularly link structure analysis and anchor text indexing, and bibliometrics, in particular citation analysis. We show that there is a close relatedness of ideas between these areas but that these ideas have not been fully explored experimentally. Chapter 3 discusses the test collection paradigm for evaluation of information retrieval systems and describes how and why we built our test collection. In Chapter 4, we introduce the ACL Anthology, the archive of computational linguistics papers that our test collection is centred around. The archive contains the most prominent publications since the beginning of the field in the early 1960s, consisting of one journal plus conferences and workshops, resulting in over 10,000 papers. Chapter 5 describes how the PDF papers are prepared for our experiments, including identification of references and citations in the papers, once converted to plain text, and extraction of citation information to an XML database. Chapter 6 presents our experiments: we show that adding citation terms to the full-text of the papers improves retrieval effectiveness by up to 7.4%, that weighting citation terms higher relative to paper terms increases the improvement and that varying the context from which citation terms are taken has a significant effect on retrieval effectiveness. Our main hypothesis that citation terms enhance a full-text representation of scientific papers is thus proven. There are some limitations to these experiments. The relevance judgements in our test collection are incomplete but we have experimentally verified that the test collection is, nevertheless, a useful evaluation tool. Using the Lemur toolkit constrained the method that we used to weight citation terms; we would like to experiment with a more realistic implementation of term weighting. Our experiments with different citation contexts did not conclude an optimal citation context; we would like to extend the scope of our investigation. Now that our test collection exists, we can address these issues in our experiments and leave the door open for more extensive experimentation.
Logsonomy - Social Information Retrieval with Logdata
Krause, B.; Jäschke, R.; Hotho, A. & Stumme, G.
, 'HT '08: Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia', ACM, New York, NY, USA, [10.1145/1379092.1379123], 157-166 (2008) [pdf]
Social bookmarking systems constitute an established part of the Web 2.0. In such systems users describe bookmarks by keywords called tags. The structure behind these social systems, called folksonomies, can be viewed as a tripartite hypergraph of user, tag and resource nodes. This underlying network shows specific structural properties that explain its growth and the possibility of serendipitous exploration. Today’s search engines represent the gateway to retrieve information from the World Wide Web. Short queries typically consisting of two to three words describe a user’s information need. In response to the displayed results of the search engine, users click on the links of the result page as they expect the answer to be of relevance. This clickdata can be represented as a folksonomy in which queries are descriptions of clicked URLs. The resulting network structure, which we will term logsonomy is very similar to the one of folksonomies. In order to find out about its properties, we analyze the topological characteristics of the tripartite hypergraph of queries, users and bookmarks on a large snapshot of del.icio.us and on query logs of two large search engines. All of the three datasets show small world properties. The tagging behavior of users, which is explained by preferential attachment of the tags in social bookmark systems, is reflected in the distribution of single query words in search engines. We can conclude that the clicking behaviour of search engine users based on the displayed search results and the tagging behaviour of social bookmarking users is driven by similar dynamics.
Introduction to Information Retrieval
Manning, C. D.; Raghavan, P. & Schütze, H.
2008, Cambridge University Press, New York [pdf]
"Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures." -- Publisher's description.
Folksonomies in Wissensrepräsentation und Information Retrieval
Peters, I. & Stock, W. G.
Information - Wissenschaft und Praxis, 59(2) 77-90 (2008) [pdf]
Folksonomies in Wissensrepräsentation und Information Retrieval.
e populären Web 2.0-Dienste werden von Prosumern – Produzenten und gleichsam Konsumenten – nicht nur dazu genutzt, Inhalte zu produzieren, sondern auch, um sie inhaltlich zu erschließen. Folksonomies erlauben es dem Nutzer, Dokumente mit eigenen Schlagworten, sog. Tags, zu beschreiben, ohne dabei auf gewisse Regeln oder Vorgaben achten zu müssen. Neben einigen Vorteilen zeigen Folksonomies aber auch zahlreiche Schwächen (u. a. einen Mangel an Präzision). Um diesen Nachteilen größtenteils entgegenzuwirken, schlagen wir eine Interpretation der Tags als natürlichsprachige Wörter vor. Dadurch ist es uns möglich, Methoden des Natural Language Processing (NLP) auf die Tags anzuwenden und so linguistische Probleme der Tags zu beseitigen. Darüber hinaus diskutieren wir Ansätze und weitere Vorschläge (Tagverteilungen, Kollaboration und akteurspezifische Aspekte) hinsichtlich eines Relevance Rankings von getaggten Dokumenten. Neben Vorschlägen auf ähnliche Dokumente („more like this!“) erlauben Folksonomies auch Hinweise auf verwandte Nutzer und damit auf Communities („more like me!“).
Folksonomies in Knowledge Representation and Information Retrieval
Web 2.0 services “prosumers” – producers and consumers – collaborate not only for the purpose of creating content, but to index these pieces of information as well. Folksonomies permit actors to describe documents with subject headings, “tags“, without regarding any rules. Apart from a lot of benefits folksonomies have many shortcomings (e.g., lack of precision). In order to solve some of the problems we propose interpreting tags as natural language terms. Accordingly, we can introduce methods of NLP to solve the tags’ linguistic problems. Additionally, we present criteria for tagged documents to create a ranking by relevance (tag distribution, collaboration and actor-based aspects). Besides recommending similar documents („more like this!“) folksonomies can be used for the recommendation of similar users and communities („more like me!“).
Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions
Cha, S.-H.
International Journal of Mathematical Models and Methods in Applied Sciences, 1(4) 300-307 (2007) [pdf]
Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering, and retrieval problems. Various distance/similarity measures that are applicable to compare two probability density functions, pdf in short, are reviewed and categorized in both syntactic and semantic relationships. A correlation coefficient and a hierarchical clustering technique are adopted to reveal similarities among numerous distance/similarity measures.
Information Retrieval in Folksonomies: Search and Ranking
Hotho, A.; Jäschke, R.; Schmitz, C. & Stumme, G.
Sure, Y. & Domingue, J., ed., 'The Semantic Web: Research and Applications', 4011(), Lecture Notes in Computer Science, Springer, Heidelberg, 411-426 (2006)
Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. The reason for their immediate success is the fact that no specific skills are needed for participating. At the moment, however, the information retrieval support is limited. We present a formal model and a new search algorithm for folksonomies, called FolkRank, that exploits the structure of the folksonomy. The proposed algorithm is also applied to find communities within the folksonomy and is used to structure search results. All findings are demonstrated on a large scale dataset.
FooCA: web information retrieval with formal concept analysis
Koester, B.
2006, Beiträge zur begrifflichen Wissensverarbeitung, Verlag Allgemeine Wissenschaft, Mühltal [pdf]
This book deals with Formal Concept Analysis (FCA) and its application to Web Information Retrieval. It explains how Web search results retrieved by major Web search engines such as Google or Yahoo can be conceptualized leading to a human-oriented form of representation. A generalization of Web search results is conducted, leading to an FCA-based introduction of FooCA. FooCA is an application in the field of Conceptual Knowledge Processing and supports the idea of a holistic representation of Web Information Retrieval.
Collaborative Tagging as a Knowledge Organisation and Resource Discovery Tool
Macgregor, G. & McCulloch, E.
Library Review, 55(5) 291-300 (2006) [pdf]
The purpose of the paper is to provide an overview of the collaborative tagging phenomenon and explore some of the reasons for its emergence. Design/methodology/approach - The paper reviews the related literature and discusses some of the problems associated with, and the potential of, collaborative tagging approaches for knowledge organisation and general resource discovery. A definition of controlled vocabularies is proposed and used to assess the efficacy of collaborative tagging. An exposition of the collaborative tagging model is provided and a review of the major contributions to the tagging literature is presented. Findings - There are numerous difficulties with collaborative tagging systems (e.g. low precision, lack of collocation, etc.) that originate from the absence of properties that characterise controlled vocabularies. However, such systems can not be dismissed. Librarians and information professionals have lessons to learn from the interactive and social aspects exemplified by collaborative tagging systems, as well as their success in engaging users with information management. The future co-existence of controlled vocabularies and collaborative tagging is predicted, with each appropriate for use within distinct information contexts: formal and informal. Research limitations/implications - Librarians and information professional researchers should be playing a leading role in research aimed at assessing the efficacy of collaborative tagging in relation to information storage, organisation, and retrieval, and to influence the future development of collaborative tagging systems. Practical implications - The paper indicates clear areas where digital libraries and repositories could innovate in order to better engage users with information. Originality/value - At time of writing there were no literature reviews summarising the main contributions to the collaborative tagging research or debate.
A taxonomy of web search
Broder, A.
SIGIR Forum, 36(2) 3-10 (2002) [pdf]
Classic IR (information retrieval) is inherently predicated on users searching for information, the so-called "information need". But the need behind a web search is often not informational -- it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). We explore this taxonomy of web searches and discuss how global search engines evolved to deal with web-specific needs.
Optimizing search engines using clickthrough data
Joachims, T.
, 'Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining', ACM, New York, NY, USA, [10.1145/775047.775067], 133-142 (2002) [pdf]
This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples.
Cumulated gain-based evaluation of IR techniques
Järvelin, K. & Kekäläinen, J.
ACM Transactions on Information Systems, 20(4) 422-446 (2002) [pdf]
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.
IR evaluation methods for retrieving highly relevant documents
Järvelin, K. & Kekäläinen, J.
, 'SIGIR '00: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval', ACM, New York, NY, USA, [10.1145/345508.345545], 41-48 (2000) [pdf]
This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modern large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) two novel measures computing the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. We then demonstrate the use of these evaluation methods in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance. The test was run with a best match retrieval system (In-Query1) in a text database consisting of newspaper articles. The results indicate that the tested strong query structures are most effective in retrieving highly relevant documents. The differences between the query types are practically essential and statistically significant. More generally, the novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
Modern Information Retrieval
Baeza-Yates, R. A. & Ribeiro-Neto, B.
1999, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA [pdf]
This is a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective. The advent of the Internet and the enormous increase in volume of electronically stored information generally has led to substantial work on IR from the computer science perspective - this book provides an up-to-date student oriented treatment of the subject.
Indexing and retrieval of scientific literature
Lawrence, S.; Bollacker, K. & Giles, C. L.
, 'Proceedings of the eighth international conference on Information and knowledge management', ACM, New York, NY, USA, [10.1145/319950.319970], 139-146 (1999) [pdf]
<par>The web has greatly improved access to scientific literature. However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages. No index covers all of the available literature, and the major web search engines typically do not index the content of Postscript/PDF documents at all. This paper discusses the creation of digital libraries of scientific literature on the web, including the efficient location of articles, full-text indexing of the articles, autonomous citation indexing, information extraction, display of query-sensitive summaries and citation context, hubs and authorities computation, similar document detection, user profiling, distributed error correction, graph analysis, and detection of overlapping documents. The software for the system is available at no cost for non-commercial use.</par>