PUMA publications for /user/hotho/duplicatehttps://puma.uni-kassel.de/user/hotho/duplicatePUMA RSS feed for /user/hotho/duplicate2024-03-29T09:35:02+01:00A Classification-based Approach for Bibliographic Metadata Deduplicationhttps://puma.uni-kassel.de/bibtex/28f87206e413c2c632b5c633f484fcbe2/hothohotho2012-01-12T10:43:17+01:00bibliographic detection duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Eduardo N. Borges" itemprop="url" href="/author/Eduardo%20N.%20Borges"><span itemprop="name">E. Borges</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Karin Becker" itemprop="url" href="/author/Karin%20Becker"><span itemprop="name">K. Becker</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Carlos A. Heuser" itemprop="url" href="/author/Carlos%20A.%20Heuser"><span itemprop="name">C. Heuser</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Renata Galante" itemprop="url" href="/author/Renata%20Galante"><span itemprop="name">R. Galante</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Proceedings of the IADIS International Conference WWW/Internet 2011</em></span></span> </span>(<em><span>2011<meta content="2011" itemprop="datePublished"/></span></em>)Thu Jan 12 10:43:17 CET 2012Proceedings of the IADIS International Conference WWW/Internet 2011 221-228A Classification-based Approach for Bibliographic Metadata Deduplication2011bibliographic detection duplicate puma Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach. An unsupervised heuristic-based approach for bibliographic metadata deduplicationhttps://puma.uni-kassel.de/bibtex/2e7bc9412f92dddbfd5eaf81648ac5849/hothohotho2012-01-11T16:20:34+01:00bibliographic duplicate puma toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Eduardo N Borges" itemprop="url" href="/author/Eduardo%20N%20Borges"><span itemprop="name">E. Borges</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Moisés G De Carvalho" itemprop="url" href="/author/Mois%c3%a9s%20G%20De%20Carvalho"><span itemprop="name">M. De Carvalho</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Renata Galante" itemprop="url" href="/author/Renata%20Galante"><span itemprop="name">R. Galante</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Marcos André Gonçalves" itemprop="url" href="/author/Marcos%20Andr%c3%a9%20Gon%c3%a7alves"><span itemprop="name">M. Gonçalves</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Alberto H F Laender" itemprop="url" href="/author/Alberto%20H%20F%20Laender"><span itemprop="name">A. Laender</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Information Processing & Management</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">47 </span></span>(<span itemprop="issueNumber">5</span>):
<span itemprop="pagination">706--718</span></em> </span>(<em><span>2011<meta content="2011" itemprop="datePublished"/></span></em>)Wed Jan 11 16:20:34 CET 2012Information Processing & Management5706--718An unsupervised heuristic-based approach for bibliographic metadata deduplication472011bibliographic duplicate puma toread Filtering Duplicate Publications in Bibliographic Databases.https://puma.uni-kassel.de/bibtex/2b8efa49fddc744c36debf5a31b8142b4/hothohotho2012-01-11T15:17:39+01:00bibliographic duplicate publications puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Xiaoyi Jiang" itemprop="url" href="/author/Xiaoyi%20Jiang"><span itemprop="name">X. Jiang</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Daniel Mojon" itemprop="url" href="/author/Daniel%20Mojon"><span itemprop="name">D. Mojon</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">NDDL</span>, </em></span><em>Seite <span itemprop="pagination">79-88</span>. </em><em><span itemprop="publisher">ICEIS Press</span>, </em>(<em><span>2001<meta content="2001" itemprop="datePublished"/></span></em>)Wed Jan 11 15:17:39 CET 2012NDDLconf/nddl/200179-88Filtering Duplicate Publications in Bibliographic Databases.2001bibliographic duplicate publications puma dblpNear-duplicate detection by instance-level constrained clustering.https://puma.uni-kassel.de/bibtex/227e76ac1174db2a3ee4a3efd34bb2e16/hothohotho2012-01-11T15:15:29+01:00bibliographic detection duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hui Yang" itemprop="url" href="/author/Hui%20Yang"><span itemprop="name">H. Yang</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="James P. Callan" itemprop="url" href="/author/James%20P.%20Callan"><span itemprop="name">J. Callan</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">SIGIR</span>, </em></span><em>Seite <span itemprop="pagination">421-428</span>. </em><em><span itemprop="publisher">ACM</span>, </em>(<em><span>2006<meta content="2006" itemprop="datePublished"/></span></em>)Wed Jan 11 15:15:29 CET 2012SIGIRconf/sigir/2006421-428Near-duplicate detection by instance-level constrained clustering.2006bibliographic detection duplicate puma dblpDuplicate record identification in bibliographic databases.https://puma.uni-kassel.de/bibtex/2fc0cb18a9ce7efd3659c07f5e3c01541/hothohotho2012-01-11T15:13:45+01:00doubletten duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Pankaj Goyal" itemprop="url" href="/author/Pankaj%20Goyal"><span itemprop="name">P. Goyal</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Inf. Syst.</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">12 </span></span>(<span itemprop="issueNumber">3</span>):
<span itemprop="pagination">239-242</span></em> </span>(<em><span>1987<meta content="1987" itemprop="datePublished"/></span></em>)Wed Jan 11 15:13:45 CET 2012Inf. Syst.3239-242Duplicate record identification in bibliographic databases.121987doubletten duplicate puma Duplicate Detection and Record Consolidation in Large Bibliographic Databases: The COPAC Database Experience.https://puma.uni-kassel.de/bibtex/2a1067917a86f9aaaa1d5610ae113436c/hothohotho2012-01-11T15:07:45+01:00detection duplicate puma toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Shirley Anne Cousins" itemprop="url" href="/author/Shirley%20Anne%20Cousins"><span itemprop="name">S. Cousins</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Journal of Information Science</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">24 </span></span>(<span itemprop="issueNumber">4</span>):
<span itemprop="pagination">231--40</span></em> </span>(<em><span>1998<meta content="1998" itemprop="datePublished"/></span></em>)Wed Jan 11 15:07:45 CET 2012Journal of Information Science4231--40Duplicate Detection and Record Consolidation in Large Bibliographic Databases: The COPAC Database Experience.241998detection duplicate puma toread COPAC is a union catalog giving access to the online catalog records of some of the largest academic research libraries in the United Kingdom and Ireland. Discussion includes ways in which duplicate detection and record consolidation procedures are carried out, along with problem areas encountered. (Author/AEF)Duplicate Detection and Record Consolidation in Large Bibliographic Databases: The COPAC Database Experience.Duplicate detection algorithms of bibliographic descriptionshttps://puma.uni-kassel.de/bibtex/2633b89b5a6827d28513545282f9f8bc7/hothohotho2012-01-11T11:56:16+01:00bibliographic detection duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Anestis Sitas" itemprop="url" href="/author/Anestis%20Sitas"><span itemprop="name">A. Sitas</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Sarantos Kapidakis" itemprop="url" href="/author/Sarantos%20Kapidakis"><span itemprop="name">S. Kapidakis</span></a></span>. </span>(<em><span>2008<meta content="2008" itemprop="datePublished"/></span></em>)Wed Jan 11 11:56:16 CET 2012Library Hi Techpp. 287-301Duplicate detection algorithms of bibliographic descriptionsVol. 26 Iss: 22008bibliographic detection duplicate puma Emerald | Library Hi Tech | Duplicate detection algorithms of bibliographic descriptionsSignature Based Duplicate Detection in Digital Librarieshttps://puma.uni-kassel.de/bibtex/26369260b8ed58d9445b8d2df0a1864f4/hothohotho2012-01-11T11:44:26+01:00bibliographic doubletten duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Lam Padmasree" itemprop="url" href="/author/Lam%20Padmasree"><span itemprop="name">L. Padmasree</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Vamshi Ambati" itemprop="url" href="/author/Vamshi%20Ambati"><span itemprop="name">V. Ambati</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Jasthi Anand Chandulal" itemprop="url" href="/author/Jasthi%20Anand%20Chandulal"><span itemprop="name">J. Chandulal</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Meda Sreenivasa Rao" itemprop="url" href="/author/Meda%20Sreenivasa%20Rao"><span itemprop="name">M. Rao</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Proceedings of the 2nd ICUDL</span>, </em></span><em>Alexandria, </em>(<em><span>2006<meta content="2006" itemprop="datePublished"/></span></em>)Wed Jan 11 11:44:26 CET 2012AlexandriaProceedings of the 2nd ICUDLSignature Based Duplicate Detection in Digital Libraries2006bibliographic doubletten duplicate puma New Issues in Near-duplicate Detection.https://puma.uni-kassel.de/bibtex/22eeececfe9ce4c4956142231523df00a/hothohotho2012-01-11T11:42:27+01:00doubletten duplicate puma toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Martin Potthast" itemprop="url" href="/author/Martin%20Potthast"><span itemprop="name">M. Potthast</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Benno Stein" itemprop="url" href="/author/Benno%20Stein"><span itemprop="name">B. Stein</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">GfKl</span>, </em></span><em>Seite <span itemprop="pagination">601-609</span>. </em><em><span itemprop="publisher">Springer</span>, </em>(<em><span>2007<meta content="2007" itemprop="datePublished"/></span></em>)Wed Jan 11 11:42:27 CET 2012GfKlconf/gfkl/2007601-609Studies in Classification, Data Analysis, and Knowledge OrganizationNew Issues in Near-duplicate Detection.2007doubletten duplicate puma toread dblpCBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Taskshttps://puma.uni-kassel.de/bibtex/2389dba4432b1340211ef6be8e3d45a1d/hothohotho2011-11-17T14:21:19+01:00algorithm detection duplicate ml toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Anish Das Sarma" itemprop="url" href="/author/Anish%20Das%20Sarma"><span itemprop="name">A. Sarma</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Ankur Jain" itemprop="url" href="/author/Ankur%20Jain"><span itemprop="name">A. Jain</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Ashwin Machanavajjhala" itemprop="url" href="/author/Ashwin%20Machanavajjhala"><span itemprop="name">A. Machanavajjhala</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Philip Bohannon" itemprop="url" href="/author/Philip%20Bohannon"><span itemprop="name">P. Bohannon</span></a></span>. </span>(<em><span>2011<meta content="2011" itemprop="datePublished"/></span></em>)<em>cite arxiv:1111.3689.</em>Thu Nov 17 14:21:19 CET 2011cite arxiv:1111.3689CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks2011algorithm detection duplicate ml toread De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off {\em recall} of identified duplicates for {\em efficiency}. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a {\em hash function}, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges. CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK {\em rolls-up} smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets at Yahoo!---consisting of over 140K movies and 40K restaurants respectively---and demonstrate the utility of CBLOCK. CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication TasksDuplicate Record Detection: A Surveyhttps://puma.uni-kassel.de/bibtex/2bfff8a370abdf14f7f882f87c1ff61e1/hothohotho2011-02-22T12:45:49+01:00duplicate linkage toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="A. K. Elmagarmid" itemprop="url" href="/author/A.%20K.%20Elmagarmid"><span itemprop="name">A. Elmagarmid</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="P. G. Ipeirotis" itemprop="url" href="/author/P.%20G.%20Ipeirotis"><span itemprop="name">P. Ipeirotis</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="V. S. Verykios" itemprop="url" href="/author/V.%20S.%20Verykios"><span itemprop="name">V. Verykios</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Knowledge and Data Engineering, IEEE Transactions on</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">19 </span></span>(<span itemprop="issueNumber">1</span>):
<span itemprop="pagination">1--16</span></em> </span>(<em><span>2007<meta content="2007" itemprop="datePublished"/></span></em>)Tue Feb 22 12:45:49 CET 2011Knowledge and Data Engineering, IEEE Transactions on11--16Duplicate Record Detection: A Survey192007duplicate linkage toread Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the areaDuplicate Record Detection for Database Cleansinghttps://puma.uni-kassel.de/bibtex/2d14c2c587d32c0c91184183298683c10/hothohotho2011-02-22T12:40:51+01:00detection duplicate <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Mariam Rehman" itemprop="url" href="/author/Mariam%20Rehman"><span itemprop="name">M. Rehman</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Vatcharapon Esichaikul" itemprop="url" href="/author/Vatcharapon%20Esichaikul"><span itemprop="name">V. Esichaikul</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Machine Vision, International Conference on</em></span></span> </span>(<em><span>2009<meta content="2009" itemprop="datePublished"/></span></em>)Tue Feb 22 12:40:51 CET 2011Los Alamitos, CA, USAMachine Vision, International Conference on333-338Duplicate Record Detection for Database Cleansing02009detection duplicate Duplicate Record Detection for Database CleansingMapping Bibliographic Records with Bibliographic Hash Keyshttps://puma.uni-kassel.de/bibtex/201f6fe57f46e4b92fe02869341efdd8d/hothohotho2009-04-07T17:35:58+02:002009 duplicate fingerprint hash iteg mapping myown tagorapub <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Jakob Voss" itemprop="url" href="/author/Jakob%20Voss"><span itemprop="name">J. Voss</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Andreas Hotho" itemprop="url" href="/author/Andreas%20Hotho"><span itemprop="name">A. Hotho</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Robert Jäschke" itemprop="url" href="/author/Robert%20J%c3%a4schke"><span itemprop="name">R. Jäschke</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Information: Droge, Ware oder Commons?</span>, </em></span><em>Hochschulverband Informationswissenschaft, </em><em><span itemprop="publisher">Verlag Werner Hülsbusch</span>, </em>(<em><span>2009<meta content="2009" itemprop="datePublished"/></span></em>)Tue Apr 07 17:35:58 CEST 2009Information: Droge, Ware oder Commons?Proceedings of the ISIMapping Bibliographic Records with Bibliographic Hash Keys20092009 duplicate fingerprint hash iteg mapping myown tagorapub This poster presents a set of hash keys for bibliographic records called bibkeys. Unlike other methods of duplicate detection, bibkeys can directly be calculated from a set of basic metadata fields (title, authors/editors, year). It is shown how bibkeys are used to map similar bibliographic records in BibSonomy and among distributed library catalogs and other distributed databases.Collection statistics for fast duplicate document detection.https://puma.uni-kassel.de/bibtex/224249e2a7b8b809050f9083fc75d3c18/hothohotho2007-11-23T14:24:47+01:00detection document duplicate toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Abdur Chowdhury" itemprop="url" href="/author/Abdur%20Chowdhury"><span itemprop="name">A. Chowdhury</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Ophir Frieder" itemprop="url" href="/author/Ophir%20Frieder"><span itemprop="name">O. Frieder</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="David A. Grossman" itemprop="url" href="/author/David%20A.%20Grossman"><span itemprop="name">D. Grossman</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="M. Catherine McCabe" itemprop="url" href="/author/M.%20Catherine%20McCabe"><span itemprop="name">M. McCabe</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>ACM Trans. Inf. Syst.</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">20 </span></span>(<span itemprop="issueNumber">2</span>):
<span itemprop="pagination">171-191</span></em> </span>(<em><span>2002<meta content="2002" itemprop="datePublished"/></span></em>)Fri Nov 23 14:24:47 CET 2007ACM Trans. Inf. Syst.2171-191Collection statistics for fast duplicate document detection.202002detection document duplicate toread dblpSyntactic Clustering of the Web.https://puma.uni-kassel.de/bibtex/2b88a36c088beef971845324c862599d0/hothohotho2007-03-07T17:56:08+01:00detection duplicate toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Andrei Z. Broder" itemprop="url" href="/author/Andrei%20Z.%20Broder"><span itemprop="name">A. Broder</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Steven C. Glassman" itemprop="url" href="/author/Steven%20C.%20Glassman"><span itemprop="name">S. Glassman</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Mark S. Manasse" itemprop="url" href="/author/Mark%20S.%20Manasse"><span itemprop="name">M. Manasse</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Geoffrey Zweig" itemprop="url" href="/author/Geoffrey%20Zweig"><span itemprop="name">G. Zweig</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Computer Networks</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">29 </span></span>(<span itemprop="issueNumber">8-13</span>):
<span itemprop="pagination">1157-1166</span></em> </span>(<em><span>1997<meta content="1997" itemprop="datePublished"/></span></em>)Wed Mar 07 17:56:08 CET 2007Computer Networks8-131157-1166Syntactic Clustering of the Web.291997detection duplicate toread dblp