Calculating Relevancy when Searching in Semantically and Hierarchically Structured Data as Exemplified with the eZ Publish Enterprise Content Management System�s Internal Search Engine
T. Nunninger. Albert-Ludwigs-Universit�t Freiburg, (January 2007)
mantical meaning. Classical concepts of information retrieval usually regard data as plain text; thus they are not able to take into account the structure and semantic meaning of XML data. XML retrieval mainly faces two challenges. First, you need a powerful query language that allows you to formulate queries that take into account both structural and content related search conditions. I will describe XIRQL as introduced by FG04. It is the most powerful and generic query language for XML retrieval as it is based on XPath and extends it by several concepts for information retrieval. The second challenge is that you need a concept to find the most specific elements in the dataset. A relevant node (including its subtree) can contain other sub-nodes that could be relevant as well. Thus, in practice, the ranking related statistics of a node in the search index are distributed in its subtree. The question arises how to combine those ranking statistics in a subtree to compute the ranking value of the whole subtree. I only found concepts that calculate ranking values of the relevant sub-nodes and combine those ranking values with the ranking value of the root node of the subtree. As this concept is not convincing, I go one step back: instead of calculating the ranking values of the sub-nodes and �summarizing� them, I will summarize the ranking statistics of the sub-nodes, and based on this, I calculate the ranking value of the node. In this way you can adapt many proven information retrieval approaches directly, and it should be possible to easily apply advanced algorithms for fine-tuning the result as well. Finally, I implemented an evaluation environment using basic concepts of XIRQL and the vector space model. A case studie with real user feedback provided promising results.