The Google Similarity Distance
R. Cilibrasi, und P. Vitanyi.
IEEE Transactions on Knowledge and Data Engineering (2007)

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.

URL

http://www.citebase.org/abstract?id=oai:arXiv.org:cs/0412098

Suchen auf

Diese Publikation wurde noch nicht bewertet.

Bewertungsverteilung

Durchschnittliche Benutzerbewertung0,0 von 5.0 auf Grundlage von 0 Rezensionen

Bitte melden Sie sich an um selbst Rezensionen oder Kommentare zu erstellen.

@article{cilibrasi2007google,
  abstract = { Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.},
  added-at = {2011-02-04T16:10:01.000+0100},
  author = {Cilibrasi, Rudi and Vitanyi, Paul M. B.},
  biburl = {https://puma.uni-kassel.de/bibtex/200ba496f53767b92d5965db71eeea8bf/benz},
  description = {[cs/0412098] The Google Similarity Distance},
  interhash = {8fc73a93c327ea9a45ef793242ac3508},
  intrahash = {00ba496f53767b92d5965db71eeea8bf},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  keywords = {google relatedness_measures web_based imported},
  pages = 370,
  timestamp = {2011-02-04T16:10:01.000+0100},
  title = {The Google Similarity Distance},
  url = {http://www.citebase.org/abstract?id=oai:arXiv.org:cs/0412098},
  volume = 19,
  year = 2007
}

%0 Journal Article
%1 cilibrasi2007google
%A Cilibrasi, Rudi
%A Vitanyi, Paul M. B.
%D 2007
%J IEEE Transactions on Knowledge and Data Engineering
%K google relatedness_measures web_based imported
%P 370
%T The Google Similarity Distance
%U http://www.citebase.org/abstract?id=oai:arXiv.org:cs/0412098
%V 19
%X Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.

PUMA

The Google Similarity Distance
R. Cilibrasi, und P. Vitanyi.
IEEE Transactions on Knowledge and Data Engineering (2007)

Tags

Nutzer

Kommentare und Rezensionen

Zitieren Sie diese Publikation

PUMA

The Google Similarity DistanceR. Cilibrasi, und P. Vitanyi. IEEE Transactions on Knowledge and Data Engineering (2007)

Tags

Nutzer

Kommentare und Rezensionen

Zitieren Sie diese Publikation

The Google Similarity Distance
R. Cilibrasi, und P. Vitanyi.
IEEE Transactions on Knowledge and Data Engineering (2007)