ConDist: A Context-Driven Categorical Distance Measure.
In: ECMLPKDD2015
(Herausgeber): .
2015.
Markus Ring, Florian Otto, Martin Becker, Thomas Niebler, Dieter Landes und Andreas Hotho.
[BibTeX]
Linguistic Regularities in Sparse and Explicit Word Representations..
In: R. Morante und W. tau Yih
(Herausgeber):
CoNLL, Seiten 171-180.
ACL, 2014.
Omer Levy und Yoav Goldberg.
[doi]
[BibTeX]
Characterizing Semantic Relatedness of Search Query Terms.
In:
Proceedings of the 1st Workshop on Explorative Analytics of Information Networks (EIN2009).
Bled, Slovenia, 2009.
Dominik Benz, Beate Krause, G. Praveen Kumar, Andreas Hotho und Gerd Stumme.
[BibTeX]
Evaluating Similarity Measures for Emergent Semantics of Social Tagging.
In:
18th International World Wide Web Conference, Seiten 641-641.
2009.
Benjamin Markines, Ciro Cattuto, Filippo Menczer, Dominik Benz, Andreas Hotho und Gerd Stumme.
[doi]
[Kurzfassung]
[BibTeX]
Social bookmarking systems and their emergent information structures, known as folksonomies, are increasingly important data sources for Semantic Web applications. A key question for harvesting semantics from these systems is how to extend and adapt traditional notions of similarity to folksonomies, and which measures are best suited for applications such as navigation support, semantic search, and ontology learning. Here we build an evaluation framework to compare various general folksonomy-based similarity measures derived from established information-theoretic, statistical, and practical measures. Our framework deals generally and symmetrically with users, tags, and resources. For evaluation purposes we focus on similarity among tags and resources, considering different ways to aggregate annotations across users. After comparing how tag similarity measures predict user-created tag relations, we provide an external grounding by user-validated semantic proxies based on WordNet and the Open Directory. We also investigate the issue of scalability. We ?nd that mutual information with distributional micro-aggregation across users yields the highest accuracy, but is not scalable; per-user projection with collaborative aggregation provides the best scalable approach via incremental computations. The results are consistent across resource and tag similarity.
Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems.
2008.
Ciro Cattuto, Dominik Benz, Andreas Hotho und Gerd Stumme.
[doi]
[Kurzfassung]
[BibTeX]
Social bookmarking systems allow users to organise collections of resources on the Web in a collaborative fashion. The increasing popularity of these systems as well as first insights into their emergent semantics have made them relevant to disciplines like knowledge extraction and ontology learning. The problem of devising methods to measure the semantic relatedness between tags and characterizing it semantically is still largely open. Here we analyze three measures of tag relatedness: tag co-occurrence, cosine similarity of co-occurrence distributions, and FolkRank, an adaptation of the PageRank algorithm to folksonomies. Each measure is computed on tags from a large-scale dataset crawled from the social bookmarking system del.icio.us. To provide a semantic grounding of our findings, a connection to WordNet (a semantic lexicon for the English language) is established by mapping tags into synonym sets of WordNet, and applying there well-known metrics of semantic similarity. Our results clearly expose different characteristics of the selected measures of relatedness, making them applicable to different subtasks of knowledge extraction such as synonym detection or discovery of concept hierarchies.
Locally Expandable Allocation of Folksonomy Tags in a Directed Acyclic Graph..
In: J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim und X. S. Wang
(Herausgeber):
WISE, Band 5175, Reihe Lecture Notes in Computer Science, Seiten 151-162.
Springer, 2008.
Takeharu Eda, Masatoshi Yoshikawa und Masashi Yamamuro.
[doi]
[BibTeX]
Extracting semantic relations from query logs.
In:
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, Seiten 76-85.
ACM, New York, NY, USA, 2007.
Ricardo Baeza-Yates und Alessandro Tiberi.
[doi]
[BibTeX]
Measuring semantic similarity between words using web search engines.
In:
WWW '07: Proceedings of the 16th international conference on World Wide Web, Seiten 757-766.
ACM, New York, NY, USA, 2007.
Danushka Bollegala, Yutaka Matsuo und Mitsuru Ishizuka.
[BibTeX]
Time-dependent semantic similarity measure of queries using historical click-through data.
In:
WWW '06: Proceedings of the 15th international conference on World Wide Web, Seiten 543-552.
ACM Press, New York, NY, USA, 2006.
Qiankun Zhao, Steven C. H. Hoi, Tie-Yan Liu, Sourav S. Bhowmick, Michael R. Lyu und Wei-Ying Ma.
[doi]
[Kurzfassung]
[BibTeX]
It has become a promising direction to measure similarity of Web search queries by mining the increasing amount of click-through data logged by Web search engines, which record the interactions between users and the search engines. Most existing approaches employ the click-through data for similarity measure of queries with little consideration of the temporal factor, while the click-through data is often dynamic and contains rich temporal information. In this paper we present a new framework of time-dependent query semantic similarity model on exploiting the temporal characteristics of historical click-through data. The intuition is that more accurate semantic similarity values between queries can be obtained by taking into account the timestamps of the log data. With a set of user-defined calendar schema and calendar patterns, our time-dependent query similarity model is constructed using the marginalized kernel technique, which can exploit both explicit similarity and implicit semantics from the click-through data effectively. Experimental results on a large set of click-through data acquired from a commercial search engine show that our time-dependent query similarity model is more accurate than the existing approaches. Moreover, we observe that our time-dependent query similarity model can, to some extent, reflect real-world semantics such as real-world events that are happening over time.
Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit.
In: Y. Ioannidis, M. H. Scholl, J. W. Schmidt, F. Matthes, M. Hatzopoulos, K. Boehm, A. Kemper, T. Grust und C. Boehm
(Herausgeber):
10th International Conference on Extending Database Technology (EDBT 2006), Band 3896, Reihe Lecture Notes in Computer Science, Seiten 59-76.
Springer, Munich, Germany, March 26-31, 2006.
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich und Abraham Bernstein.
[BibTeX]
From Distributional to Semantic Similarity.
Doktorarbeit, Institute for Communicating and Collaborative Systems School of Informatics University of Edinburgh, 2003.
James Richard Curran.
[doi]
[Kurzfassung]
[BibTeX]
Lexical-semantic resources, including thesauri and WOR DNE T, have been successfully incor-
porated into a wide range of applications in Natural Language Processing. However they are
very difficult and expensive to create and maintain, and their usefulness has been severely
hampered by their limited coverage, bias and inconsistency. Automated and semi-automated
methods for developing such resources are therefore crucial for further resource development
and improved application performance.
Systems that extract thesauri often identify similar words using the distributional hypothesis
that similar words appear in similar contexts. This approach involves using corpora to examine
the contexts each word appears in and then calculating the similarity between context distri-
butions. Different definitions of context can be used, and I begin by examining how different
types of extracted context influence similarity.
To be of most benefit these systems must be capable of finding synonyms for rare words.
Reliable context counts for rare events can only be extracted from vast collections of text. In
this dissertation I describe how to extract contexts from a corpus of over 2 billion words. I
describe techniques for processing text on this scale and examine the trade-off between context
accuracy, information content and quantity of text analysed.
Distributional similarity is at best an approximation to semantic similarity. I develop improved
approximations motivated by the intuition that some events in the context distribution are more
indicative of meaning than others. For instance, the object-of-verb context wear is far more
indicative of a clothing noun than get. However, existing distributional techniques do not
effectively utilise this information. The new context-weighted similarity metric I propose in
this dissertation significantly outperforms every distributional similarity metric described in
the literature.
Nearest-neighbour similarity algorithms scale poorly with vocabulary and context vector size.
To overcome this problem I introduce a new context-weighted approximation algorithm with
bounded complexity in context vector size that significantly reduces the system runtime with
only a minor performance penalty. I also describe a parallelized version of the system that runs
on a Beowulf cluster for the 2 billion word experiments.
To evaluate the context-weighted similarity measure I compare ranked similarity lists against
gold-standard resources using precision and recall-based measures from Information Retrieval,
since the alternative, application-based evaluation, can often be influenced by distributional
as well as semantic similarity. I also perform a detailed analysis of the final results using
WOR DNE T.
Finally, I apply my similarity metric to the task of assigning words to WOR DNE T semantic
categories. I demonstrate that this new approach outperforms existing methods and overcomes
some of their weaknesses.
Building Hypertext Links By Computing Semantic Similarity.
IEEE Transactions on Knowledge and Data Engineering, 11:713-730, 1999.
S.J. Green.
[BibTeX]
Measures of distributional similarity.
In:
Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Seiten 25-32.
Association for Computational Linguistics, Morristown, NJ, USA, 1999.
Lillian Lee.
[doi]
[BibTeX]
Syntactic clustering of the Web.
Computer Networks and ISDN Systems, 29(8-13):1157-1166, 1997.
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse und Geoffrey Zweig.
[doi]
[Kurzfassung]
[BibTeX]
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.
Semantic similarity based on corpus statistics and lexical taxonomy.
CoRR, cmp-lg/9709008, 1997.
Jay J. Jiang und David W. Conrath.
[BibTeX]
The limitations of term co-occurrence data for query expansion in document retrieval systems.
Journal of the American Society for Information Science, 42(5):378-383, 1991.
Helen J. Peat und Peter Willett.
[doi]
[BibTeX]