Publikationen

Ring, M.; Otto, F.; Becker, M.; Niebler, T.; Landes, D. & Hotho, A. (2015): ConDist: A Context-Driven Categorical Distance Measure.

Levy, O. & Goldberg, Y. (2014): Linguistic Regularities in Sparse and Explicit Word Representations.. In: CoNLL, [Volltext]

Benz, D.; Krause, B.; Kumar, G. P.; Hotho, A. & Stumme, G. (2009): Characterizing Semantic Relatedness of Search Query Terms. In: Proceedings of the 1st Workshop on Explorative Analytics of Information Networks (EIN2009), Bled, Slovenia.

Markines, B.; Cattuto, C.; Menczer, F.; Benz, D.; Hotho, A. & Stumme, G. (2009): Evaluating Similarity Measures for Emergent Semantics of Social Tagging. In: 18th International World Wide Web Conference, [Volltext]

Cattuto, C.; Benz, D.; Hotho, A. & Stumme, G. (2008): Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems.
[Volltext]

Eda, T.; Yoshikawa, M. & Yamamuro, M. (2008): Locally Expandable Allocation of Folksonomy Tags in a Directed Acyclic Graph.. In: WISE, [Volltext]

Baeza-Yates, R. & Tiberi, A. (2007): Extracting semantic relations from query logs. In: KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA. [Volltext]

Bollegala, D.; Matsuo, Y. & Ishizuka, M. (2007): Measuring semantic similarity between words using web search engines. In: WWW '07: Proceedings of the 16th international conference on World Wide Web, New York, NY, USA.

Zhao, Q.; Hoi, S. C. H.; Liu, T.-Y.; Bhowmick, S. S.; Lyu, M. R. & Ma, W.-Y. (2006): Time-dependent semantic similarity measure of queries using historical click-through data. In: WWW '06: Proceedings of the 15th international conference on World Wide Web, New York, NY, USA. [Volltext]

		@inproceedings{1135858,
		  author = {Zhao, Qiankun and Hoi, Steven C. H. and Liu, Tie-Yan and Bhowmick, Sourav S. and Lyu, Michael R. and Ma, Wei-Ying},
		  title = {Time-dependent semantic similarity measure of queries using historical click-through data},
		  booktitle = {WWW '06: Proceedings of the 15th international conference on World Wide Web},
		  publisher = {ACM Press},
		  address = {New York, NY, USA},
		  year = {2006},
		  pages = {543--552},
		  url = {http://portal.acm.org/citation.cfm?id=1135858},
		  doi = {http://doi.acm.org/10.1145/1135777.1135858},
		  isbn = {1-59593-323-9},
		  keywords = {log, query, search, semantic, similarity},
		  abstract = {It has become a promising direction to measure similarity of Web search queries by mining the increasing amount of click-through data logged by Web search engines, which record the interactions between users and the search engines. Most existing approaches employ the click-through data for similarity measure of queries with little consideration of the temporal factor, while the click-through data is often dynamic and contains rich temporal information. In this paper we present a new framework of time-dependent query semantic similarity model on exploiting the temporal characteristics of historical click-through data. The intuition is that more accurate semantic similarity values between queries can be obtained by taking into account the timestamps of the log data. With a set of user-defined calendar schema and calendar patterns, our time-dependent query similarity model is constructed using the marginalized kernel technique, which can exploit both explicit similarity and implicit semantics from the click-through data effectively. Experimental results on a large set of click-through data acquired from a commercial search engine show that our time-dependent query similarity model is more accurate than the existing approaches. Moreover, we observe that our time-dependent query similarity model can, to some extent, reflect real-world semantics such as real-world events that are happening over time.}
		  }
		

Ziegler, P.; Kiefer, C.; Sturm, C.; Dittrich, K. R. & Bernstein, A. (2006): Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit. In: 10th International Conference on Extending Database Technology (EDBT 2006), Munich, Germany, March 26-31.

Curran, J. R.: From Distributional to Semantic Similarity. Institute for Communicating and Collaborative Systems School of Informatics University of Edinburgh, 2003
[Volltext]

Green, S. (1999): Building Hypertext Links By Computing Semantic Similarity. In: IEEE Transactions on Knowledge and Data Engineering, Vol. 11, Erscheinungsjahr/Year: 1999. Seiten/Pages: 713-730.

Lee, L. (1999): Measures of distributional similarity. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA. [Volltext]

Broder, A. Z.; Glassman, S. C.; Manasse, M. S. & Zweig, G. (1997): Syntactic clustering of the Web. In: Computer Networks and ISDN Systems, Ausgabe/Number: 8-13, Vol. 29, Erscheinungsjahr/Year: 1997. Seiten/Pages: 1157-1166. [Volltext]

Jiang, J. J. & Conrath, D. W. (1997): Semantic similarity based on corpus statistics and lexical taxonomy. In: CoRR, Vol. cmp-lg/9709008, Erscheinungsjahr/Year: 1997.

Peat, H. J. & Willett, P. (1991): The limitations of term co-occurrence data for query expansion in document retrieval systems. In: Journal of the American Society for Information Science, Ausgabe/Number: 5, Vol. 42, Verlag/Publisher: Copyright © 1991 John Wiley & Sons, Inc.. Erscheinungsjahr/Year: 1991. Seiten/Pages: 378-383. [Volltext]

P	Ring, M.; Otto, F.; Becker, M.; Niebler, T.; Landes, D. & Hotho, A. (2015): ConDist: A Context-Driven Categorical Distance Measure. [BibTeX][Endnote] @inproceedings{ring2015condist, author = {Ring, Markus and Otto, Florian and Becker, Martin and Niebler, Thomas and Landes, Dieter and Hotho, Andreas}, title = {ConDist: A Context-Driven Categorical Distance Measure}, editor = {ECMLPKDD2015}, year = {2015}, keywords = {2015, categorical, data, learning, measure, myown, similarity, unsupervised} } %0 = inproceedings %A = Ring, Markus and Otto, Florian and Becker, Martin and Niebler, Thomas and Landes, Dieter and Hotho, Andreas %D = 2015 %T = ConDist: A Context-Driven Categorical Distance Measure
P	Levy, O. & Goldberg, Y. (2014): Linguistic Regularities in Sparse and Explicit Word Representations.. In: CoNLL, [Volltext] [BibTeX][Endnote] @inproceedings{conf/conll/LevyG14, author = {Levy, Omer and Goldberg, Yoav}, title = {Linguistic Regularities in Sparse and Explicit Word Representations.}, editor = {Morante, Roser and tau Yih, Wen}, booktitle = {CoNLL}, publisher = {ACL}, year = {2014}, pages = {171-180}, url = {http://dblp.uni-trier.de/db/conf/conll/conll2014.html#LevyG14}, isbn = {978-1-941643-02-0}, keywords = {kallimachos, posts, representation, similarity, toread, word} } %0 = inproceedings %A = Levy, Omer and Goldberg, Yoav %B = CoNLL %D = 2014 %I = ACL %T = Linguistic Regularities in Sparse and Explicit Word Representations. %U = http://dblp.uni-trier.de/db/conf/conll/conll2014.html#LevyG14
P	Benz, D.; Krause, B.; Kumar, G. P.; Hotho, A. & Stumme, G. (2009): Characterizing Semantic Relatedness of Search Query Terms. In: Proceedings of the 1st Workshop on Explorative Analytics of Information Networks (EIN2009), Bled, Slovenia. [BibTeX][Endnote] @inproceedings{benz2009characterizing, author = {Benz, Dominik and Krause, Beate and Kumar, G. Praveen and Hotho, Andreas and Stumme, Gerd}, title = {Characterizing Semantic Relatedness of Search Query Terms}, booktitle = {Proceedings of the 1st Workshop on Explorative Analytics of Information Networks (EIN2009)}, address = {Bled, Slovenia}, year = {2009}, keywords = {2009, ecml, measures, myown, pkdd, similarity, workshop} } %0 = inproceedings %A = Benz, Dominik and Krause, Beate and Kumar, G. Praveen and Hotho, Andreas and Stumme, Gerd %B = Proceedings of the 1st Workshop on Explorative Analytics of Information Networks (EIN2009) %C = Bled, Slovenia %D = 2009 %T = Characterizing Semantic Relatedness of Search Query Terms
P	Markines, B.; Cattuto, C.; Menczer, F.; Benz, D.; Hotho, A. & Stumme, G. (2009): Evaluating Similarity Measures for Emergent Semantics of Social Tagging. In: 18th International World Wide Web Conference, [Volltext] [Kurzfassung] [BibTeX][Endnote] Social bookmarking systems and their emergent information structures, known as folksonomies, are increasingly important data sources for Semantic Web applications. A key question for harvesting semantics from these systems is how to extend and adapt traditional notions of similarity to folksonomies, and which measures are best suited for applications such as navigation support, semantic search, and ontology learning. Here we build an evaluation framework to compare various general folksonomy-based similarity measures derived from established information-theoretic, statistical, and practical measures. Our framework deals generally and symmetrically with users, tags, and resources. For evaluation purposes we focus on similarity among tags and resources, considering different ways to aggregate annotations across users. After comparing how tag similarity measures predict user-created tag relations, we provide an external grounding by user-validated semantic proxies based on WordNet and the Open Directory. We also investigate the issue of scalability. We ?nd that mutual information with distributional micro-aggregation across users yields the highest accuracy, but is not scalable; per-user projection with collaborative aggregation provides the best scalable approach via incremental computations. The results are consistent across resource and tag similarity. @inproceedings{www200965, author = {Markines, Benjamin and Cattuto, Ciro and Menczer, Filippo and Benz, Dominik and Hotho, Andreas and Stumme, Gerd}, title = {Evaluating Similarity Measures for Emergent Semantics of Social Tagging}, booktitle = {18th International World Wide Web Conference}, year = {2009}, pages = {641--641}, url = {http://www2009.eprints.org/65/}, keywords = {2009, measure, myown, semantics, similarity, social, tagging, taggingsurvey, tagorapub}, abstract = {Social bookmarking systems and their emergent information structures, known as folksonomies, are increasingly important data sources for Semantic Web applications. A key question for harvesting semantics from these systems is how to extend and adapt traditional notions of similarity to folksonomies, and which measures are best suited for applications such as navigation support, semantic search, and ontology learning. Here we build an evaluation framework to compare various general folksonomy-based similarity measures derived from established information-theoretic, statistical, and practical measures. Our framework deals generally and symmetrically with users, tags, and resources. For evaluation purposes we focus on similarity among tags and resources, considering different ways to aggregate annotations across users. After comparing how tag similarity measures predict user-created tag relations, we provide an external grounding by user-validated semantic proxies based on WordNet and the Open Directory. We also investigate the issue of scalability. We ?nd that mutual information with distributional micro-aggregation across users yields the highest accuracy, but is not scalable; per-user projection with collaborative aggregation provides the best scalable approach via incremental computations. The results are consistent across resource and tag similarity.} } %0 = inproceedings %A = Markines, Benjamin and Cattuto, Ciro and Menczer, Filippo and Benz, Dominik and Hotho, Andreas and Stumme, Gerd %B = 18th International World Wide Web Conference %D = 2009 %T = Evaluating Similarity Measures for Emergent Semantics of Social Tagging %U = http://www2009.eprints.org/65/
	Cattuto, C.; Benz, D.; Hotho, A. & Stumme, G. (2008): Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems. [Volltext] [Kurzfassung] [BibTeX] [Endnote] Social bookmarking systems allow users to organise collections of resources on the Web in a collaborative fashion. The increasing popularity of these systems as well as first insights into their emergent semantics have made them relevant to disciplines like knowledge extraction and ontology learning. The problem of devising methods to measure the semantic relatedness between tags and characterizing it semantically is still largely open. Here we analyze three measures of tag relatedness: tag co-occurrence, cosine similarity of co-occurrence distributions, and FolkRank, an adaptation of the PageRank algorithm to folksonomies. Each measure is computed on tags from a large-scale dataset crawled from the social bookmarking system del.icio.us. To provide a semantic grounding of our findings, a connection to WordNet (a semantic lexicon for the English language) is established by mapping tags into synonym sets of WordNet, and applying there well-known metrics of semantic similarity. Our results clearly expose different characteristics of the selected measures of relatedness, making them applicable to different subtasks of knowledge extraction such as synonym detection or discovery of concept hierarchies. @misc{cattuto-2008, author = {Cattuto, Ciro and Benz, Dominik and Hotho, Andreas and Stumme, Gerd}, title = {Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems}, year = {2008}, url = {http://www.citebase.org/abstract?id=oai:arXiv.org:0805.2045}, keywords = {2008, analysis, learning, myown, ol, ontology, semantic, similarity, tag}, abstract = { Social bookmarking systems allow users to organise collections of resources on the Web in a collaborative fashion. The increasing popularity of these systems as well as first insights into their emergent semantics have made them relevant to disciplines like knowledge extraction and ontology learning. The problem of devising methods to measure the semantic relatedness between tags and characterizing it semantically is still largely open. Here we analyze three measures of tag relatedness: tag co-occurrence, cosine similarity of co-occurrence distributions, and FolkRank, an adaptation of the PageRank algorithm to folksonomies. Each measure is computed on tags from a large-scale dataset crawled from the social bookmarking system del.icio.us. To provide a semantic grounding of our findings, a connection to WordNet (a semantic lexicon for the English language) is established by mapping tags into synonym sets of WordNet, and applying there well-known metrics of semantic similarity. Our results clearly expose different characteristics of the selected measures of relatedness, making them applicable to different subtasks of knowledge extraction such as synonym detection or discovery of concept hierarchies.} } %0 = misc %A = Cattuto, Ciro and Benz, Dominik and Hotho, Andreas and Stumme, Gerd %B = } %C = %D = 2008 %I = %T = Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems} %U = http://www.citebase.org/abstract?id=oai:arXiv.org:0805.2045
P	Eda, T.; Yoshikawa, M. & Yamamuro, M. (2008): Locally Expandable Allocation of Folksonomy Tags in a Directed Acyclic Graph.. In: WISE, [Volltext] [BibTeX][Endnote] @inproceedings{conf/wise/EdaYY08, author = {Eda, Takeharu and Yoshikawa, Masatoshi and Yamamuro, Masashi}, title = {Locally Expandable Allocation of Folksonomy Tags in a Directed Acyclic Graph.}, editor = {Bailey, James and Maier, David and Schewe, Klaus-Dieter and Thalheim, Bernhard and Wang, Xiaoyang Sean}, booktitle = {WISE}, series = {Lecture Notes in Computer Science}, publisher = {Springer}, year = {2008}, volume = {5175}, pages = {151-162}, url = {http://dblp.uni-trier.de/db/conf/wise/wise2008.html#EdaYY08}, isbn = {978-3-540-85480-7}, keywords = {ol, similarity, toread} } %0 = inproceedings %A = Eda, Takeharu and Yoshikawa, Masatoshi and Yamamuro, Masashi %B = WISE %D = 2008 %I = Springer %T = Locally Expandable Allocation of Folksonomy Tags in a Directed Acyclic Graph. %U = http://dblp.uni-trier.de/db/conf/wise/wise2008.html#EdaYY08
P	Baeza-Yates, R. & Tiberi, A. (2007): Extracting semantic relations from query logs. In: KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA. [Volltext] [BibTeX][Endnote] @inproceedings{1281204, author = {Baeza-Yates, Ricardo and Tiberi, Alessandro}, title = {Extracting semantic relations from query logs}, booktitle = {KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining}, publisher = {ACM}, address = {New York, NY, USA}, year = {2007}, pages = {76--85}, url = {http://portal.acm.org/citation.cfm?id=1281204}, doi = {http://doi.acm.org/10.1145/1281192.1281204}, isbn = {978-1-59593-609-7}, keywords = {log, query, search, semantic, similarity} } %0 = inproceedings %A = Baeza-Yates, Ricardo and Tiberi, Alessandro %B = KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining %C = New York, NY, USA %D = 2007 %I = ACM %T = Extracting semantic relations from query logs %U = http://portal.acm.org/citation.cfm?id=1281204
P	Bollegala, D.; Matsuo, Y. & Ishizuka, M. (2007): Measuring semantic similarity between words using web search engines. In: WWW '07: Proceedings of the 16th international conference on World Wide Web, New York, NY, USA. [BibTeX][Endnote] @inproceedings{Bollegala07semanticSearch, author = {Bollegala, Danushka and Matsuo, Yutaka and Ishizuka, Mitsuru}, title = {Measuring semantic similarity between words using web search engines}, booktitle = {WWW '07: Proceedings of the 16th international conference on World Wide Web}, publisher = {ACM}, address = {New York, NY, USA}, year = {2007}, pages = {757--766}, doi = {http://doi.acm.org/10.1145/1242572.1242675}, isbn = {978-1-59593-654-7}, keywords = {search, semantic, similarity, toread} } %0 = inproceedings %A = Bollegala, Danushka and Matsuo, Yutaka and Ishizuka, Mitsuru %B = WWW '07: Proceedings of the 16th international conference on World Wide Web %C = New York, NY, USA %D = 2007 %I = ACM %T = Measuring semantic similarity between words using web search engines
P	Zhao, Q.; Hoi, S. C. H.; Liu, T.-Y.; Bhowmick, S. S.; Lyu, M. R. & Ma, W.-Y. (2006): Time-dependent semantic similarity measure of queries using historical click-through data. In: WWW '06: Proceedings of the 15th international conference on World Wide Web, New York, NY, USA. [Volltext] [Kurzfassung] [BibTeX][Endnote] It has become a promising direction to measure similarity of Web search queries by mining the increasing amount of click-through data logged by Web search engines, which record the interactions between users and the search engines. Most existing approaches employ the click-through data for similarity measure of queries with little consideration of the temporal factor, while the click-through data is often dynamic and contains rich temporal information. In this paper we present a new framework of time-dependent query semantic similarity model on exploiting the temporal characteristics of historical click-through data. The intuition is that more accurate semantic similarity values between queries can be obtained by taking into account the timestamps of the log data. With a set of user-defined calendar schema and calendar patterns, our time-dependent query similarity model is constructed using the marginalized kernel technique, which can exploit both explicit similarity and implicit semantics from the click-through data effectively. Experimental results on a large set of click-through data acquired from a commercial search engine show that our time-dependent query similarity model is more accurate than the existing approaches. Moreover, we observe that our time-dependent query similarity model can, to some extent, reflect real-world semantics such as real-world events that are happening over time. @inproceedings{1135858, author = {Zhao, Qiankun and Hoi, Steven C. H. and Liu, Tie-Yan and Bhowmick, Sourav S. and Lyu, Michael R. and Ma, Wei-Ying}, title = {Time-dependent semantic similarity measure of queries using historical click-through data}, booktitle = {WWW '06: Proceedings of the 15th international conference on World Wide Web}, publisher = {ACM Press}, address = {New York, NY, USA}, year = {2006}, pages = {543--552}, url = {http://portal.acm.org/citation.cfm?id=1135858}, doi = {http://doi.acm.org/10.1145/1135777.1135858}, isbn = {1-59593-323-9}, keywords = {log, query, search, semantic, similarity}, abstract = {It has become a promising direction to measure similarity of Web search queries by mining the increasing amount of click-through data logged by Web search engines, which record the interactions between users and the search engines. Most existing approaches employ the click-through data for similarity measure of queries with little consideration of the temporal factor, while the click-through data is often dynamic and contains rich temporal information. In this paper we present a new framework of time-dependent query semantic similarity model on exploiting the temporal characteristics of historical click-through data. The intuition is that more accurate semantic similarity values between queries can be obtained by taking into account the timestamps of the log data. With a set of user-defined calendar schema and calendar patterns, our time-dependent query similarity model is constructed using the marginalized kernel technique, which can exploit both explicit similarity and implicit semantics from the click-through data effectively. Experimental results on a large set of click-through data acquired from a commercial search engine show that our time-dependent query similarity model is more accurate than the existing approaches. Moreover, we observe that our time-dependent query similarity model can, to some extent, reflect real-world semantics such as real-world events that are happening over time.} } %0 = inproceedings %A = Zhao, Qiankun and Hoi, Steven C. H. and Liu, Tie-Yan and Bhowmick, Sourav S. and Lyu, Michael R. and Ma, Wei-Ying %B = WWW '06: Proceedings of the 15th international conference on World Wide Web %C = New York, NY, USA %D = 2006 %I = ACM Press %T = Time-dependent semantic similarity measure of queries using historical click-through data %U = http://portal.acm.org/citation.cfm?id=1135858
P	Ziegler, P.; Kiefer, C.; Sturm, C.; Dittrich, K. R. & Bernstein, A. (2006): Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit. In: 10th International Conference on Extending Database Technology (EDBT 2006), Munich, Germany, March 26-31. [BibTeX][Endnote] @inproceedings{ziegler2006detecting, author = {Ziegler, Patrick and Kiefer, Christoph and Sturm, Christoph and Dittrich, Klaus R. and Bernstein, Abraham}, title = {Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit}, editor = {Ioannidis, Yannis and Scholl, Marc H. and Schmidt, Joachim W. and Matthes, Florian and Hatzopoulos, Mike and Boehm, Klemens and Kemper, Alfons and Grust, Torsten and Boehm, Christian}, booktitle = {10th International Conference on Extending Database Technology (EDBT 2006)}, series = {Lecture Notes in Computer Science}, publisher = {Springer}, address = {Munich, Germany, March 26-31}, year = {2006}, volume = {3896}, pages = {59-76}, keywords = {ontology, similarity, semantic} } %0 = inproceedings %A = Ziegler, Patrick and Kiefer, Christoph and Sturm, Christoph and Dittrich, Klaus R. and Bernstein, Abraham %B = 10th International Conference on Extending Database Technology (EDBT 2006) %C = Munich, Germany, March 26-31 %D = 2006 %I = Springer %T = Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit
P	Curran, J. R.: From Distributional to Semantic Similarity. Institute for Communicating and Collaborative Systems School of Informatics University of Edinburgh, 2003 [Volltext] [Kurzfassung] [BibTeX] [Endnote] Lexical-semantic resources, including thesauri and WOR DNE T, have been successfully incor- rated into a wide range of applications in Natural Language Processing. However they are ry difficult and expensive to create and maintain, and their usefulness has been severely mpered by their limited coverage, bias and inconsistency. Automated and semi-automated thods for developing such resources are therefore crucial for further resource development d improved application performance. Systems that extract thesauri often identify similar words using the distributional hypothesis at similar words appear in similar contexts. This approach involves using corpora to examine e contexts each word appears in and then calculating the similarity between context distri- tions. Different definitions of context can be used, and I begin by examining how different pes of extracted context influence similarity. To be of most benefit these systems must be capable of finding synonyms for rare words. liable context counts for rare events can only be extracted from vast collections of text. In is dissertation I describe how to extract contexts from a corpus of over 2 billion words. I scribe techniques for processing text on this scale and examine the trade-off between context curacy, information content and quantity of text analysed. Distributional similarity is at best an approximation to semantic similarity. I develop improved proximations motivated by the intuition that some events in the context distribution are more dicative of meaning than others. For instance, the object-of-verb context wear is far more dicative of a clothing noun than get. However, existing distributional techniques do not fectively utilise this information. The new context-weighted similarity metric I propose in is dissertation significantly outperforms every distributional similarity metric described in e literature. Nearest-neighbour similarity algorithms scale poorly with vocabulary and context vector size. overcome this problem I introduce a new context-weighted approximation algorithm with unded complexity in context vector size that significantly reduces the system runtime with ly a minor performance penalty. I also describe a parallelized version of the system that runs a Beowulf cluster for the 2 billion word experiments. To evaluate the context-weighted similarity measure I compare ranked similarity lists against ld-standard resources using precision and recall-based measures from Information Retrieval, nce the alternative, application-based evaluation, can often be influenced by distributional well as semantic similarity. I also perform a detailed analysis of the final results using R DNE T. nally, I apply my similarity metric to the task of assigning words to WOR DNE T semantic tegories. I demonstrate that this new approach outperforms existing methods and overcomes me of their weaknesses. @phdthesis{Curran:2003, author = {Curran, James Richard}, title = {From Distributional to Semantic Similarity}, school = {Institute for Communicating and Collaborative Systems School of Informatics University of Edinburgh}, year = {2003}, url = {http://www.era.lib.ed.ac.uk/bitstream/1842/563/2/IP030023.pdf }, keywords = {distributional, semantic, similarity, toread, wordnet}, abstract = {Lexical-semantic resources, including thesauri and WOR DNE T, have been successfully incor- rated into a wide range of applications in Natural Language Processing. However they are ry difficult and expensive to create and maintain, and their usefulness has been severely mpered by their limited coverage, bias and inconsistency. Automated and semi-automated thods for developing such resources are therefore crucial for further resource development d improved application performance. Systems that extract thesauri often identify similar words using the distributional hypothesis at similar words appear in similar contexts. This approach involves using corpora to examine e contexts each word appears in and then calculating the similarity between context distri- tions. Different definitions of context can be used, and I begin by examining how different pes of extracted context influence similarity. To be of most benefit these systems must be capable of finding synonyms for rare words. liable context counts for rare events can only be extracted from vast collections of text. In is dissertation I describe how to extract contexts from a corpus of over 2 billion words. I scribe techniques for processing text on this scale and examine the trade-off between context curacy, information content and quantity of text analysed. Distributional similarity is at best an approximation to semantic similarity. I develop improved proximations motivated by the intuition that some events in the context distribution are more dicative of meaning than others. For instance, the object-of-verb context wear is far more dicative of a clothing noun than get. However, existing distributional techniques do not fectively utilise this information. The new context-weighted similarity metric I propose in is dissertation significantly outperforms every distributional similarity metric described in e literature. Nearest-neighbour similarity algorithms scale poorly with vocabulary and context vector size. overcome this problem I introduce a new context-weighted approximation algorithm with unded complexity in context vector size that significantly reduces the system runtime with ly a minor performance penalty. I also describe a parallelized version of the system that runs a Beowulf cluster for the 2 billion word experiments. To evaluate the context-weighted similarity measure I compare ranked similarity lists against ld-standard resources using precision and recall-based measures from Information Retrieval, nce the alternative, application-based evaluation, can often be influenced by distributional well as semantic similarity. I also perform a detailed analysis of the final results using R DNE T. nally, I apply my similarity metric to the task of assigning words to WOR DNE T semantic tegories. I demonstrate that this new approach outperforms existing methods and overcomes me of their weaknesses. } } %0 = phdthesis %A = Curran, James Richard %D = 2003 %T = From Distributional to Semantic Similarity %U = http://www.era.lib.ed.ac.uk/bitstream/1842/563/2/IP030023.pdf
J	Green, S. (1999): Building Hypertext Links By Computing Semantic Similarity. In: IEEE Transactions on Knowledge and Data Engineering, Vol. 11, Erscheinungsjahr/Year: 1999. Seiten/Pages: 713-730. [BibTeX] [Endnote] @article{green99hypertext, author = {Green, S.J.}, title = {Building Hypertext Links By Computing Semantic Similarity}, journal = {IEEE Transactions on Knowledge and Data Engineering}, year = {1999}, volume = {11}, pages = {713--730}, keywords = {clustering, semantic, similarity} } %0 = article %A = Green, S.J. %D = 1999 %T = Building Hypertext Links By Computing Semantic Similarity
P	Lee, L. (1999): Measures of distributional similarity. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA. [Volltext] [BibTeX][Endnote] @inproceedings{1034693, author = {Lee, Lillian}, title = {Measures of distributional similarity}, booktitle = {Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics}, publisher = {Association for Computational Linguistics}, address = {Morristown, NJ, USA}, year = {1999}, pages = {25--32}, url = {http://portal.acm.org/citation.cfm?id=1034693&dl=}, doi = {http://dx.doi.org/10.3115/1034678.1034693}, isbn = {1-55860-609-3}, keywords = {measure, similarity, toread} } %0 = inproceedings %A = Lee, Lillian %B = Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics %C = Morristown, NJ, USA %D = 1999 %I = Association for Computational Linguistics %T = Measures of distributional similarity %U = http://portal.acm.org/citation.cfm?id=1034693&dl=
J	Broder, A. Z.; Glassman, S. C.; Manasse, M. S. & Zweig, G. (1997): Syntactic clustering of the Web. In: Computer Networks and ISDN Systems, Ausgabe/Number: 8-13, Vol. 29, Erscheinungsjahr/Year: 1997. Seiten/Pages: 1157-1166. [Volltext] [Kurzfassung] [BibTeX] [Endnote] We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights. @article{keyhere, author = {Broder, Andrei Z. and Glassman, Steven C. and Manasse, Mark S. and Zweig, Geoffrey}, title = {Syntactic clustering of the Web}, booktitle = {Papers from the Sixth International World Wide Web Conference}, journal = {Computer Networks and ISDN Systems}, year = {1997}, volume = {29}, number = {8-13}, pages = {1157--1166}, url = {http://www.sciencedirect.com/science/article/B6TYT-3SP60S4-11/2/38f44c816ec8d69b406317de1629e56d}, keywords = {Duplication, Fingerprints, Resemblance, Signatures, Similarity, Web, search}, abstract = {We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.} } %0 = article %A = Broder, Andrei Z. and Glassman, Steven C. and Manasse, Mark S. and Zweig, Geoffrey %B = Papers from the Sixth International World Wide Web Conference %D = 1997 %T = Syntactic clustering of the Web %U = http://www.sciencedirect.com/science/article/B6TYT-3SP60S4-11/2/38f44c816ec8d69b406317de1629e56d
J	Jiang, J. J. & Conrath, D. W. (1997): Semantic similarity based on corpus statistics and lexical taxonomy. In: CoRR, Vol. cmp-lg/9709008, Erscheinungsjahr/Year: 1997. [BibTeX] [Endnote] @article{jiang97semantic, author = {Jiang, Jay J. and Conrath, David W.}, title = {Semantic similarity based on corpus statistics and lexical taxonomy}, journal = {CoRR}, year = {1997}, volume = {cmp-lg/9709008}, keywords = {corpus, semantic, similarity, wordnet} } %0 = article %A = Jiang, Jay J. and Conrath, David W. %D = 1997 %T = Semantic similarity based on corpus statistics and lexical taxonomy
J	Peat, H. J. & Willett, P. (1991): The limitations of term co-occurrence data for query expansion in document retrieval systems. In: Journal of the American Society for Information Science, Ausgabe/Number: 5, Vol. 42, Verlag/Publisher: Copyright © 1991 John Wiley & Sons, Inc.. Erscheinungsjahr/Year: 1991. Seiten/Pages: 378-383. [Volltext] [BibTeX] [Endnote] @article{wiley1991, author = {Peat, Helen J. and Willett, Peter}, title = {The limitations of term co-occurrence data for query expansion in document retrieval systems}, journal = {Journal of the American Society for Information Science}, publisher = {Copyright © 1991 John Wiley & Sons, Inc.}, address = {Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom}, year = {1991}, volume = {42}, number = {5}, pages = {378-383}, url = {http://www.iro.umontreal.ca/~nie/IFT6255/Peat_Willett_QExp.pdf}, doi = {10.1002/(SICI)1097-4571(199106)42:5<378::AID-ASI8>3.0.CO;2-8}, keywords = {expansion, ir, query, similarity, term} } %0 = article %A = Peat, Helen J. and Willett, Peter %C = Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom %D = 1991 %I = Copyright © 1991 John Wiley & Sons, Inc. %T = The limitations of term co-occurrence data for query expansion in document retrieval systems %U = http://www.iro.umontreal.ca/~nie/IFT6255/Peat_Willett_QExp.pdf