PUMA publications for /tag/detection%20duplicatehttps://puma.uni-kassel.de/tag/detection%20duplicatePUMA RSS feed for /tag/detection%20duplicate2024-03-29T01:31:33+01:00Learning object identification rules for information integrationhttps://puma.uni-kassel.de/bibtex/25ad46801d602408ce271276f452263a9/jaeschkejaeschke2012-10-01T15:07:30+02:00detection duplicate entity extraction identification information integration <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Sheila Tejada" itemprop="url" href="/author/Sheila%20Tejada"><span itemprop="name">S. Tejada</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Craig A Knoblock" itemprop="url" href="/author/Craig%20A%20Knoblock"><span itemprop="name">C. Knoblock</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Steven Minton" itemprop="url" href="/author/Steven%20Minton"><span itemprop="name">S. Minton</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Information Systems</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">26 </span></span>(<span itemprop="issueNumber">8</span>):
<span itemprop="pagination">607--633</span></em> </span>(<em><span>Dezember 2001<meta content="Dezember 2001" itemprop="datePublished"/></span></em>)Mon Oct 01 15:07:30 CEST 2012Information Systemsdec8607--633Learning object identification rules for information integration262001detection duplicate entity extraction identification information integration When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects’ shared attributes in order to identify matching objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous methods of object identification have required manual construction of object identification rules or mapping rules for determining the mappings between objects. This manual process is time consuming and error-prone. In our approach. Active Atlas learns to tailor mapping rules, through limited user input, to a specific application domain. The experimental results demonstrate that we achieve higher accuracy and require less user involvement than previous methods across various application domains.A Classification-based Approach for Bibliographic Metadata Deduplicationhttps://puma.uni-kassel.de/bibtex/28f87206e413c2c632b5c633f484fcbe2/jaeschkejaeschke2012-03-05T15:51:37+01:00bibliographic classification detection duplicate metadata <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Eduardo N. Borges" itemprop="url" href="/author/Eduardo%20N.%20Borges"><span itemprop="name">E. Borges</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Karin Becker" itemprop="url" href="/author/Karin%20Becker"><span itemprop="name">K. Becker</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Carlos A. Heuser" itemprop="url" href="/author/Carlos%20A.%20Heuser"><span itemprop="name">C. Heuser</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Renata Galante" itemprop="url" href="/author/Renata%20Galante"><span itemprop="name">R. Galante</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Proceedings of the IADIS International Conference WWW/Internet 2011</em></span></span> </span>(<em><span>2011<meta content="2011" itemprop="datePublished"/></span></em>)Mon Mar 05 15:51:37 CET 2012Proceedings of the IADIS International Conference WWW/Internet 2011 221--228A Classification-based Approach for Bibliographic Metadata Deduplication2011bibliographic classification detection duplicate metadata Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach. A Classification-based Approach for Bibliographic Metadata Deduplicationhttps://puma.uni-kassel.de/bibtex/28f87206e413c2c632b5c633f484fcbe2/hothohotho2012-01-12T10:43:17+01:00bibliographic detection duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Eduardo N. Borges" itemprop="url" href="/author/Eduardo%20N.%20Borges"><span itemprop="name">E. Borges</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Karin Becker" itemprop="url" href="/author/Karin%20Becker"><span itemprop="name">K. Becker</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Carlos A. Heuser" itemprop="url" href="/author/Carlos%20A.%20Heuser"><span itemprop="name">C. Heuser</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Renata Galante" itemprop="url" href="/author/Renata%20Galante"><span itemprop="name">R. Galante</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Proceedings of the IADIS International Conference WWW/Internet 2011</em></span></span> </span>(<em><span>2011<meta content="2011" itemprop="datePublished"/></span></em>)Thu Jan 12 10:43:17 CET 2012Proceedings of the IADIS International Conference WWW/Internet 2011 221-228A Classification-based Approach for Bibliographic Metadata Deduplication2011bibliographic detection duplicate puma Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach. Near-duplicate detection by instance-level constrained clustering.https://puma.uni-kassel.de/bibtex/227e76ac1174db2a3ee4a3efd34bb2e16/hothohotho2012-01-11T15:15:29+01:00bibliographic detection duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hui Yang" itemprop="url" href="/author/Hui%20Yang"><span itemprop="name">H. Yang</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="James P. Callan" itemprop="url" href="/author/James%20P.%20Callan"><span itemprop="name">J. Callan</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">SIGIR</span>, </em></span><em>Seite <span itemprop="pagination">421-428</span>. </em><em><span itemprop="publisher">ACM</span>, </em>(<em><span>2006<meta content="2006" itemprop="datePublished"/></span></em>)Wed Jan 11 15:15:29 CET 2012SIGIRconf/sigir/2006421-428Near-duplicate detection by instance-level constrained clustering.2006bibliographic detection duplicate puma dblpDuplicate Detection and Record Consolidation in Large Bibliographic Databases: The COPAC Database Experience.https://puma.uni-kassel.de/bibtex/2a1067917a86f9aaaa1d5610ae113436c/hothohotho2012-01-11T15:07:45+01:00detection duplicate puma toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Shirley Anne Cousins" itemprop="url" href="/author/Shirley%20Anne%20Cousins"><span itemprop="name">S. Cousins</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Journal of Information Science</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">24 </span></span>(<span itemprop="issueNumber">4</span>):
<span itemprop="pagination">231--40</span></em> </span>(<em><span>1998<meta content="1998" itemprop="datePublished"/></span></em>)Wed Jan 11 15:07:45 CET 2012Journal of Information Science4231--40Duplicate Detection and Record Consolidation in Large Bibliographic Databases: The COPAC Database Experience.241998detection duplicate puma toread COPAC is a union catalog giving access to the online catalog records of some of the largest academic research libraries in the United Kingdom and Ireland. Discussion includes ways in which duplicate detection and record consolidation procedures are carried out, along with problem areas encountered. (Author/AEF)Duplicate Detection and Record Consolidation in Large Bibliographic Databases: The COPAC Database Experience.Duplicate detection algorithms of bibliographic descriptionshttps://puma.uni-kassel.de/bibtex/2633b89b5a6827d28513545282f9f8bc7/hothohotho2012-01-11T11:56:16+01:00bibliographic detection duplicate puma <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Anestis Sitas" itemprop="url" href="/author/Anestis%20Sitas"><span itemprop="name">A. Sitas</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Sarantos Kapidakis" itemprop="url" href="/author/Sarantos%20Kapidakis"><span itemprop="name">S. Kapidakis</span></a></span>. </span>(<em><span>2008<meta content="2008" itemprop="datePublished"/></span></em>)Wed Jan 11 11:56:16 CET 2012Library Hi Techpp. 287-301Duplicate detection algorithms of bibliographic descriptionsVol. 26 Iss: 22008bibliographic detection duplicate puma Emerald | Library Hi Tech | Duplicate detection algorithms of bibliographic descriptionsCBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Taskshttps://puma.uni-kassel.de/bibtex/2389dba4432b1340211ef6be8e3d45a1d/hothohotho2011-11-17T14:21:19+01:00algorithm detection duplicate ml toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Anish Das Sarma" itemprop="url" href="/author/Anish%20Das%20Sarma"><span itemprop="name">A. Sarma</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Ankur Jain" itemprop="url" href="/author/Ankur%20Jain"><span itemprop="name">A. Jain</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Ashwin Machanavajjhala" itemprop="url" href="/author/Ashwin%20Machanavajjhala"><span itemprop="name">A. Machanavajjhala</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Philip Bohannon" itemprop="url" href="/author/Philip%20Bohannon"><span itemprop="name">P. Bohannon</span></a></span>. </span>(<em><span>2011<meta content="2011" itemprop="datePublished"/></span></em>)<em>cite arxiv:1111.3689.</em>Thu Nov 17 14:21:19 CET 2011cite arxiv:1111.3689CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks2011algorithm detection duplicate ml toread De-duplication---identification of distinct records referring to the same real-world entity---is a well-known challenge in data integration. Since very large datasets prohibit the comparison of every pair of records, {\em blocking} has been identified as a technique of dividing the dataset for pairwise comparisons, thereby trading off {\em recall} of identified duplicates for {\em efficiency}. Traditional de-duplication tasks, while challenging, typically involved a fixed schema such as Census data or medical records. However, with the presence of large, diverse sets of structured data on the web and the need to organize it effectively on content portals, de-duplication systems need to scale in a new dimension to handle a large number of schemas, tasks and data sets, while handling ever larger problem sizes. In addition, when working in a map-reduce framework it is important that canopy formation be implemented as a {\em hash function}, making the canopy design problem more challenging. We present CBLOCK, a system that addresses these challenges. CBLOCK learns hash functions automatically from attribute domains and a labeled dataset consisting of duplicates. Subsequently, CBLOCK expresses blocking functions using a hierarchical tree structure composed of atomic hash functions. The application may guide the automated blocking process based on architectural constraints, such as by specifying a maximum size of each block (based on memory requirements), impose disjointness of blocks (in a grid environment), or specify a particular objective function trading off recall for efficiency. As a post-processing step to automatically generated blocks, CBLOCK {\em rolls-up} smaller blocks to increase recall. We present experimental results on two large-scale de-duplication datasets at Yahoo!---consisting of over 140K movies and 40K restaurants respectively---and demonstrate the utility of CBLOCK. CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication TasksDuplicate Record Detection for Database Cleansinghttps://puma.uni-kassel.de/bibtex/2d14c2c587d32c0c91184183298683c10/hothohotho2011-02-22T12:40:51+01:00detection duplicate <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Mariam Rehman" itemprop="url" href="/author/Mariam%20Rehman"><span itemprop="name">M. Rehman</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Vatcharapon Esichaikul" itemprop="url" href="/author/Vatcharapon%20Esichaikul"><span itemprop="name">V. Esichaikul</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Machine Vision, International Conference on</em></span></span> </span>(<em><span>2009<meta content="2009" itemprop="datePublished"/></span></em>)Tue Feb 22 12:40:51 CET 2011Los Alamitos, CA, USAMachine Vision, International Conference on333-338Duplicate Record Detection for Database Cleansing02009detection duplicate Duplicate Record Detection for Database CleansingMapping Bibliographic Records with Bibliographic Hash Keyshttps://puma.uni-kassel.de/bibtex/201f6fe57f46e4b92fe02869341efdd8d/jaeschkejaeschke2009-06-05T15:44:48+02:002009 bibkey bibliographic bibtex detection duplicate hash key myown <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Jakob Voss" itemprop="url" href="/author/Jakob%20Voss"><span itemprop="name">J. Voss</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Andreas Hotho" itemprop="url" href="/author/Andreas%20Hotho"><span itemprop="name">A. Hotho</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Robert Jäschke" itemprop="url" href="/author/Robert%20J%c3%a4schke"><span itemprop="name">R. Jäschke</span></a></span>. </span><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Information: Droge, Ware oder Commons?</span>, </em></span><em>Hochschulverband Informationswissenschaft, </em><em><span itemprop="publisher">Verlag Werner Hülsbusch</span>, </em>(<em><span>2009<meta content="2009" itemprop="datePublished"/></span></em>)Fri Jun 05 15:44:48 CEST 2009Information: Droge, Ware oder Commons?Proceedings of the ISIMapping Bibliographic Records with Bibliographic Hash Keys20092009 bibkey bibliographic bibtex detection duplicate hash key myown This poster presents a set of hash keys for bibliographic records called bibkeys. Unlike other methods of duplicate detection, bibkeys can directly be calculated from a set of basic metadata fields (title, authors/editors, year). It is shown how bibkeys are used to map similar bibliographic records in BibSonomy and among distributed library catalogs and other distributed databases.Collection statistics for fast duplicate document detection.https://puma.uni-kassel.de/bibtex/224249e2a7b8b809050f9083fc75d3c18/hothohotho2007-11-23T14:24:47+01:00detection document duplicate toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Abdur Chowdhury" itemprop="url" href="/author/Abdur%20Chowdhury"><span itemprop="name">A. Chowdhury</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Ophir Frieder" itemprop="url" href="/author/Ophir%20Frieder"><span itemprop="name">O. Frieder</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="David A. Grossman" itemprop="url" href="/author/David%20A.%20Grossman"><span itemprop="name">D. Grossman</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="M. Catherine McCabe" itemprop="url" href="/author/M.%20Catherine%20McCabe"><span itemprop="name">M. McCabe</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>ACM Trans. Inf. Syst.</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">20 </span></span>(<span itemprop="issueNumber">2</span>):
<span itemprop="pagination">171-191</span></em> </span>(<em><span>2002<meta content="2002" itemprop="datePublished"/></span></em>)Fri Nov 23 14:24:47 CET 2007ACM Trans. Inf. Syst.2171-191Collection statistics for fast duplicate document detection.202002detection document duplicate toread dblpSyntactic Clustering of the Web.https://puma.uni-kassel.de/bibtex/2b88a36c088beef971845324c862599d0/hothohotho2007-03-07T17:56:08+01:00detection duplicate toread <span class="authorEditorList"><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Andrei Z. Broder" itemprop="url" href="/author/Andrei%20Z.%20Broder"><span itemprop="name">A. Broder</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Steven C. Glassman" itemprop="url" href="/author/Steven%20C.%20Glassman"><span itemprop="name">S. Glassman</span></a></span>, <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Mark S. Manasse" itemprop="url" href="/author/Mark%20S.%20Manasse"><span itemprop="name">M. Manasse</span></a></span>, und <span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Geoffrey Zweig" itemprop="url" href="/author/Geoffrey%20Zweig"><span itemprop="name">G. Zweig</span></a></span>. </span><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><span itemtype="http://schema.org/Periodical" itemscope="itemscope" itemprop="isPartOf"><span itemprop="name"><em>Computer Networks</em></span></span> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">29 </span></span>(<span itemprop="issueNumber">8-13</span>):
<span itemprop="pagination">1157-1166</span></em> </span>(<em><span>1997<meta content="1997" itemprop="datePublished"/></span></em>)Wed Mar 07 17:56:08 CET 2007Computer Networks8-131157-1166Syntactic Clustering of the Web.291997detection duplicate toread dblp