Measuring the Similarity of Concept Hierarchies and its Influence on the Evaluation of Learning Procedures
K. Dellschaft. Institute for Computer Science, University of Koblenz-Landau, Germany, (December 2005)
The information available in corporate intranets and in the Internet grows from day to day. Looking for a specific information often the question is how to find it. Therefore it is the aim of researchers to allow a more efficient access to large collections of information. Many of the developed algorithms are dependent on additional domain knowledge for improving the achieved results (see (Gonzalo et al., 1998) and (De Buenaga Rodr�guez et al., 2000)). The domain knowledge is often available in the form of ontologies. An ontology reflects the understanding of a domain, on which a community has agreed upon. An ontology consists of different parts like a set of concepts and their mutual relations. These concepts are organized in a hierarchy of sub- and superconcepts. In order to actually improve the results of an application with the help of an ontology, it is crucial to accurately and exhaustively model the domain in question. Because this is a very complex and time consuming task it is a goal to extract an ontology at least semi-automatically. Such learning procedures use documents from the domain for extracting the necessary information. Often these documents are natural language texts like websites or dictionaries which contain domain knowledge (see (Kietz, Maedche and Volz, 2000) and (Cimiano, Hotho and Staab, 2004)). The quality of an automatically learned ontology is basically influenced by two parameters: The actual learning procedure and the document corpus. There exist several alternative learning procedures. They are further differentiated by the types of documents which they can process, i.e. whether they can process unstructured, semi-structured or structured documents. Websites are an example for unstructured documents, while dictionary entries and encyclopedia articles are examples for semi-structured documents. Documents containing artificial languages like database schemes are finally classified as structured documents. It is often assumed that the availability of structural information leads to a better quality of the extracted ontology. In order to enable a comparison of the different learning procedures, so that one can choose the best procedure for a certain purpose, they are often evaluated on an example corpus of documents. Subsequently it is tried to objectively measure the quality of the extracted ontology. Such an evaluation may also be used for fine tuning the parameters of a learning procedure, so that better results are achieved. One way of objectively evaluating a learning procedure is to measure the similarity between the learned ontology and a previously defined reference ontology. This similarity is then an equivalent for the quality. It is assumed that the learning procedure will always produce results with a comparable quality. This quality will only be influenced by the document corpus which must contain the correct informations.