TY  - CHAP
AU  - Piskorski, Jakub
AU  - Yangarber, Roman
A2  - Poibeau, Thierry
A2  - Saggion, Horacio
A2  - Piskorski, Jakub
A2  - Yangarber, Roman
T1  - Information Extraction: Past, Present and Future
T2  - Multi-source, Multilingual Information Extraction and Summarization
PB  - Springer Berlin Heidelberg
CY  - 
PY  - 2013/
VL  - 
IS  - 
SP  - 23
EP  - 49
UR  - http://dx.doi.org/10.1007/978-3-642-28569-1_2
M3  - 10.1007/978-3-642-28569-1_2
KW  - extraction
KW  - information
KW  - sota
KW  - survey
L1  - 
SN  - 978-3-642-28568-4
N1  - Information Extraction: Past, Present and Future - Springer
N1  - 
AB  - In this chapter we present a brief overview of Information Extraction, which is an area of natural language processing that deals with finding factual information in free text. In formal terms, 
ER  -

TY  - GEN
AU  - 
A2  - Poibeau, Thierry
A2  - Saggion, Horacio
A2  - Piskorski, Jakub
A2  - Yangarber, Roman
T1  - Multi-source, multilingual information extraction and summarization
JO  - 
PB  - Springer
AD  - Berlin; New York
PY  - 2013/
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://link.springer.com/book/10.1007/978-3-642-28569-1
M3  - 
KW  - extraction
KW  - information
KW  - multi
KW  - multilingual
KW  - sota
KW  - summarization
L1  - 
N1  - Multi-source, Multilingual Information Extraction and Summarization - Springer
N1  - 
AB  - Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges due to the inherent multi-source nature of the information explosion.  The technologies must now handle not isolated texts or individual narratives, but rather large-scale repositories and streams--in general, in multiple languages--containing a multiplicity of perspectives, opinions, or commentaries on particular topics, entities or events.  There is thus a need to adapt existing techniques and develop new ones to deal with these challenges. This volume contains a selection of papers that present a variety of methodologies for content identification and extraction, as well as for content fusion and regeneration. The chapters cover various aspects of the challenges, depending on the nature of the information sought--names vs. events,-- and the nature of the sources--news streams vs. image captions vs. scientific research papers, etc. This volume aims to offer a broad and representative sample of studies from this very active research field.
ER  -

TY  - JOUR
AU  - Balke, Wolf-Tilo
T1  - Introduction to Information Extraction: Basic Notions and Current Trends
JO  - Datenbank-Spektrum
PY  - 2012/
VL  - 12
IS  - 2
SP  - 81
EP  - 88
UR  - http://dx.doi.org/10.1007/s13222-012-0090-x
M3  - 10.1007/s13222-012-0090-x
KW  - extraction
KW  - ie
KW  - information
KW  - survey
L1  - 
SN  - 
N1  - 
N1  - 
AB  - Transforming unstructured or semi-structured information into structured knowledge is one of the big challenges of today’s knowledge society. While this abstract goal is still unreached and probably unreachable, intelligent information extraction techniques are considered key ingredients on the way to generating and representing knowledge for a wide variety of applications. This is especially true for the current efforts to turn the World Wide Web being the world’s largest collection of information into the world’s largest knowledge base. This introduction gives a broad overview about the major topics and current trends in information extraction.
ER  -

TY  - CONF
AU  - Granitzer, Michael
AU  - Hristakeva, Maya
AU  - Knight, Robert
AU  - Jack, Kris
AU  - Kern, Roman
A2  - 
T1  - A comparison of layout based bibliographic metadata extraction techniques
T2  - Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
PB  - ACM
CY  - New York, NY, USA
PY  - 2012/
M2  - 
VL  - 
IS  - 
SP  - 19:1
EP  - 19:8
UR  - http://doi.acm.org/10.1145/2254129.2254154
M3  - 10.1145/2254129.2254154
KW  - extraction
KW  - ie
KW  - information
L1  - 
SN  - 978-1-4503-0915-8
N1  - A comparison of layout based bibliographic metadata extraction techniques
N1  - 
AB  - Social research networks such as Mendeley and CiteULike offer various services for collaboratively managing bibliographic metadata. Compared with traditional libraries, metadata quality is of crucial importance in order to create a crowdsourced bibliographic catalog for search and browsing. Artifacts, in particular PDFs which are managed by the users of the social research networks, become one important metadata source and the starting point for creating a homogeneous, high quality, bibliographic catalog. Natural Language Processing and Information Extraction techniques have been employed to extract structured information from unstructured sources. However, given highly heterogeneous artifacts that cover a range of publication styles, stemming from different publication sources, and imperfect PDF processing tools, how accurate are metadata extraction methods in such real-world settings? This paper focuses on answering that question by investigating the use of Conditional Random Fields and Support Vector Machines on real-world data gathered from Mendeley and Linked-Data repositories. We compare style and content features on existing state-of-the-art methods on two newly created real-world data sets for metadata extraction. Our analysis shows that two-stage SVMs provide reasonable performance in solving the challenge of metadata extraction for crowdsourcing bibliographic metadata management.
ER  -

TY  - CONF
AU  - Gunes, Omer
AU  - Schallhart, Christian
AU  - Furche, Tim
AU  - Lehmann, Jens
AU  - Ngomo, Axel-Cyrille Ngonga
A2  - 
T1  - EAGER: extending automatically gazetteers for entity recognition
T2  - Proceedings of the 3rd Workshop on the People's Web Meets NLP: Collaboratively  Constructed Semantic Resources and their Applications to NLP
PB  - 
CY  - 
PY  - 2012/07
M2  - 
VL  - 
IS  - 
SP  - 29
EP  - 33
UR  - http://acl.eldoc.ub.rug.nl/mirror/W/W12/W12-4005.pdf
M3  - 
KW  - dbpedia
KW  - eager
KW  - entity
KW  - extraction
KW  - gazetteer
KW  - ie
KW  - named
KW  - ner
KW  - recognition
KW  - wikipedia
L1  - 
SN  - 
N1  - 
N1  - 
AB  - Key to named entity recognition, the manual gazetteering of entity lists is a costly, errorprone process that often yields results that are incomplete and suffer from sampling bias. Exploiting current sources of structured information, we propose a novel method for extending minimal seed lists into complete gazetteers. Like previous approaches, we value W IKIPEDIA as a huge, well-curated, and relatively unbiased source of entities. However, in contrast to previous work, we exploit not only its content, but also its structure, as exposed in DBPEDIA. We extend gazetteers through Wikipedia categories, carefully limiting the impact of noisy categorizations. The resulting gazetteers easily outperform previous approaches on named entity recognition. 
ER  -

TY  - CONF
AU  - Klügl, Peter
AU  - Toepfer, Martin
AU  - Lemmerich, Florian
AU  - Hotho, Andreas
AU  - Puppe, Frank
A2  - Flach, Peter A.
A2  - Bie, Tijl De
A2  - Cristianini, Nello
T1  - Collective Information Extraction with Context-Specific Consistencies.
T2  - ECML/PKDD (1)
PB  - Springer
CY  - 
PY  - 2012/
M2  - 
VL  - 7523
IS  - 
SP  - 728
EP  - 743
UR  - http://dblp.uni-trier.de/db/conf/pkdd/pkdd2012-1.html#KluglTLHP12
M3  - 
KW  - 2012
KW  - context
KW  - extraction
KW  - ie
KW  - information
KW  - myown
L1  - 
SN  - 978-3-642-33459-7
N1  - 
N1  - 
AB  - 
ER  -

TY  - JOUR
AU  - Song, Yi-Cheng
AU  - Zhang, Yong-Dong
AU  - Cao, Juan
AU  - Xia, Tian
AU  - Liu, Wu
AU  - Li, Jin-Tao
T1  - Web Video Geolocation by Geotagged Social Resources
JO  - Transactions on Multimedia
PY  - 2012/04
VL  - 14
IS  - 2
SP  - 456
EP  - 470
UR  - http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6054059
M3  - 10.1109/TMM.2011.2172937
KW  - extraction
KW  - geo
KW  - location
KW  - map
KW  - video
L1  - 
SN  - 
N1  - 
N1  - 
AB  - This paper considers the problem of web video geolocation: we hope to determine where on the Earth a web video was taken. By analyzing a 6.5-million geotagged web video dataset, we observe that there exist inherent geography intimacies between a video with its relevant videos (related videos and same-author videos). This social relationship supplies a direct and effective cue to locate the video to a particular region on the earth. Based on this observation, we propose an effective web video geolocation algorithm by propagating geotags among the web video social relationship graph. For the video that have no geotagged relevant videos, we aim to collect those geotagged relevant images that are content similar with the video (share some visual or textual information with the video) as the cue to infer the location of the video. The experiments have demonstrated the effectiveness of both methods, with the geolocation accuracy much better than state-of-the-art approaches. Finally, an online web video geolocation system: Video2Locatoin (V2L) is developed to provide public access to our algorithm.
ER  -

TY  - CONF
AU  - Chrupala, Grzegorz
AU  - Klakow, Dietrich
A2  - Calzolari, Nicoletta
A2  - Choukri, Khalid
A2  - Maegaard, Bente
A2  - Mariani, Joseph
A2  - Odijk, Jan
A2  - Piperidis, Stelios
A2  - Rosner, Mike
A2  - Tapias, Daniel
T1  - A Named Entity Labeler for German: Exploiting Wikipedia and Distributional Clusters.
T2  - LREC
PB  - European Language Resources Association
CY  - 
PY  - 2010/
M2  - 
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://lexitron.nectec.or.th/public/LREC-2010_Malta/pdf/538_Paper.pdf
M3  - 
KW  - entity
KW  - extraction
KW  - german
KW  - ie
KW  - information
KW  - named
KW  - ner
KW  - recognition
KW  - wikipedia
L1  - 
SN  - 2-9517408-6-7
N1  - 
N1  - 
AB  - 
ER  -

TY  - JOUR
AU  - Raykar, Vikas C.
AU  - Yu, Shipeng
AU  - Zhao, Linda H.
AU  - Valadez, Gerardo Hermosillo
AU  - Florin, Charles
AU  - Bogoni, Luca
AU  - Moy, Linda
T1  - Learning From Crowds
JO  - Journal of Machine Learning Research
PY  - 2010/08
VL  - 11
IS  - 
SP  - 1297
EP  - 1322
UR  - http://dl.acm.org/citation.cfm?id=1756006.1859894
M3  - 
KW  - cirg
KW  - collective
KW  - computing
KW  - crowdsourcing
KW  - extraction
KW  - human
KW  - ie
KW  - information
KW  - intelligence
KW  - learning
KW  - machine
KW  - ml
KW  - social
L1  - 
SN  - 
N1  - 
N1  - 
AB  - For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline.
ER  -

TY  - CONF
AU  - Fink, C.
AU  - Piatko, C.
AU  - Mayfield, J.
AU  - Chou, D.
AU  - Finin, T.
AU  - Martineau, J.
A2  - 
T1  - The Geolocation of Web Logs from Textual Clues
T2  - Proceedings of the International Conference on Computational Science and Engineering
PB  - 
CY  - 
PY  - 2009/08
M2  - 
VL  - 4
IS  - 
SP  - 1088
EP  - 1092
UR  - http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5282996
M3  - 10.1109/CSE.2009.584
KW  - blog
KW  - extraction
KW  - geo
KW  - location
KW  - map
KW  - toponym
KW  - web
L1  - 
SN  - 
N1  - 
N1  - 
AB  - Understanding the spatial distribution of people who author social media content is of growing interest for researchers and commerce. Blogging platforms depend on authors reporting their own location. However, not all authors report or reveal their location on their blog's home page. Automated geolocation strategies using IP address and domain name are not adequate for determining an author's location because most blogs are not self-hosted. In this paper we describe a method that uses the place name mentions in a blog to determine an author's location. We achieved an accuracy of 63% on a collection of 844 blogs with known locations.
ER  -

TY  - JOUR
AU  - Ley, Michael
T1  - DBLP: some lessons learned
JO  - Proceedings of the VLDB Endowment
PY  - 2009/08
VL  - 2
IS  - 2
SP  - 1493
EP  - 1500
UR  - http://dl.acm.org/citation.cfm?id=1687553.1687577
M3  - 
KW  - analysis
KW  - author
KW  - bibliography
KW  - citation
KW  - computer
KW  - dblp
KW  - entity
KW  - extraction
KW  - identification
KW  - ie
KW  - information
KW  - named
KW  - resolution
KW  - science
L1  - 
SN  - 
N1  - 
N1  - 
AB  - The DBLP Computer Science Bibliography evolved from an early small experimental Web server to a popular service for the computer science community. Many design decisions and details of the public XML-records behind DBLP never were documented. This paper is a review of the evolution of DBLP. The main perspective is data modeling. In DBLP persons play a central role, our discussion of person names may be applicable to many other data bases. All DBLP data are available for your own experiments. You may either download the complete set, or use a simple XML-based API described in an online appendix.
ER  -

TY  - CONF
AU  - Martins, B.
AU  - Manguinhas, H.
AU  - Borbinha, J.
A2  - 
T1  - Extracting and Exploring the Geo-Temporal Semantics of Textual Resources
T2  - Proceedings of the International Conference on Semantic Computing
PB  - IEEE Computer Society
CY  - 
PY  - 2008/08
M2  - 
VL  - 
IS  - 
SP  - 1
EP  - 9
UR  - http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4597167
M3  - 10.1109/ICSC.2008.86
KW  - data
KW  - extraction
KW  - geo
KW  - map
KW  - semantic
KW  - text
L1  - 
SN  - 
N1  - 
N1  - 
AB  - Geo-temporal criteria are important for filtering, grouping and prioritizing information resources. This presents techniques for extracting semantic geo-temporal information from text, using simple text mining methods that leverage on a gazetteer. A prototype system, implementing the proposed methods and capable of displaying information over maps and timelines, is described. This prototype can take input in RSS, demonstrating the application to content from many different online sources. Experimental results demonstrate the efficiency and accuracy of the proposed approaches.
ER  -

TY  - THES
AU  - Leidner, Jochen Lothar
T1  - Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Names
PY  - 2007/
PB  - School of Informatics, University of Edinburgh
SP  - 
EP  - 
UR  - http://www.era.lib.ed.ac.uk/bitstream/1842/1849/1/leidner-2007-phd.pdf
M3  - 
KW  - extraction
KW  - geo
KW  - map
KW  - name
KW  - place
KW  - resolution
KW  - toponym
L1  - 
N1  - 
N1  - 
AB  - Background. In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g., using latitude and longitude). However, present-day GIS systems provide no automatic geo-coding functionality for unstructured text. In Information Extraction (IE), processing of named entities in text has traditionally been seen as a two-step process comprising a flat text span recognition sub-task and an atomic classification sub-task; relating the text span to a model of the world has been ignored by evaluations such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)). However, spatial and temporal expressions refer to events in space-time, and the grounding of events is a precondition for accurate reasoning. Thus, automatic grounding can improve many applications such as automatic map drawing (e.g. for choosing a focus) and question answering (e.g., for questions like How far is London from Edinburgh?, given a story in which both occur and can be resolved). Whereas temporal grounding has received considerable attention in the recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been neglected. Concentrating on geographic names for populated places, I define the task of automatic Toponym Resolution (TR) as computing the mapping from occurrences of names for places as found in a text to a representation of the extensional  semantics of the location referred to (its referent), such as a geographic latitude/longitude footprint. The task of mapping from names to locations is hard due to insufficient and noisy databases, and a large degree of ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). In addition, names of places and the boundaries referred to change over time, and databases are incomplete. Objective. I investigate how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text. I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal, reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics (e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then investigate how to combine these sources of evidence to obtain a superior method. I also investigate the noise effect introduced by the named entity tagging step that toponym resolution relies on in a sequential system pipeline architecture. Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented in the gazetteer defined and, accordingly, a collection of present-day news text. I limit the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges), compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance, is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately toponym resolution. However, this is beyond the scope of this thesis. Method. While a small number of previous attempts have been made to solve the toponym resolution problem, these were either not evaluated, or evaluation was done by manual inspection of system output instead of curating a reusable reference corpus. Since the relevant literature is scattered across several disciplines (GIS, digital libraries, information retrieval, natural language processing) and descriptions of algorithms are mostly given in informal prose, I attempt to systematically describe them and aim at a reconstruction in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required. Unfortunately, to date no gold standard has been curated in the research community. To this end, a reference gazetteer and an associated novel reference corpus with human-labeled referent annotation are created. These are subsequently used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. I then compare the performance of the same TR algorithms under three different conditions, namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation using an existing Maximum Entropy sequence tagging model, and (iii) a na ̈ve toponym lookup procedure in a gazetteer. Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or component evaluation. To this end, we define a task-specific matching criterion to be used with traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient with respect to numerical gazetteer imprecision in situations where one toponym instance is marked up with different gazetteer entries in the gold standard and the test set, respectively, but where these refer to the same candidate referent, caused by multiple near-duplicate entries in the reference gazetteer. Main Contributions. The major contributions of this thesis are as follows:     • A new reference corpus in which instances of location named entities have been manually annotated with spatial grounding information for populated places, and an associated reference gazetteer, from which the assigned candidate referents are chosen. This reference gazetteer provides numerical latitude/longitude coordinates (such as 51◦ 32 North, 0◦ 5 West) as well as hierarchical path descriptions (such as London > UK) with respect to a world wide-coverage, geographic taxonomy constructed by combining several large,   but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora, a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard.   This corpus will be made available as a reference evaluation resource; • a new method and implemented system to resolve toponyms that is capable of robustly processing unseen text (open-domain online newswire text) and grounding toponym instances in an extensional model using longitude and latitude coordinates and hierarchical path descriptions, using internal (textual) and external (gazetteer) evidence; • an empirical analysis of the relative utility of various heuristic biases and other sources of evidence with respect to the toponym resolution task when analysing free news genre text; • a comparison between a replicated method as described in the literature, which functions a baseline, and a novel algorithm based on minimality heuristics; and • several exemplary prototypical applications to show how the resulting toponym resolution methods can be used to create visual surrogates for news stories, a geographic exploration tool for news browsing, geographically-aware document retrieval and to answer spatial questions (How far...?) in an open-domain question answering system. These applications only have demonstrative character, as a thorough quantitative, task-based (extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work. 
ER  -

TY  - CONF
AU  - Mihalcea, Rada
AU  - Csomai, Andras
A2  - 
T1  - Wikify!: linking documents to encyclopedic knowledge
T2  - Proceedings of the sixteenth ACM Conference on information and knowledge management
PB  - ACM
CY  - New York, NY, USA
PY  - 2007/
M2  - 
VL  - 
IS  - 
SP  - 233
EP  - 242
UR  - http://doi.acm.org/10.1145/1321440.1321475
M3  - 10.1145/1321440.1321475
KW  - entity
KW  - extraction
KW  - named
KW  - ner
KW  - wikify
KW  - wikipedia
L1  - 
SN  - 978-1-59593-803-9
N1  - 
N1  - 
AB  - This paper introduces the use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation, and shows how this online encyclopedia can be used to achieve state-of-the-art results on both these tasks. The paper also shows how the two methods can be combined into a system able to automatically enrich a text with links to encyclopedic knowledge. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages. Evaluations of the system show that the automatic annotations are reliable and hardly distinguishable from manual annotations.
ER  -

TY  - CONF
AU  - Garbin, Eric
AU  - Mani, Inderjeet
A2  - 
T1  - Disambiguating toponyms in news
T2  - Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
PB  - Association for Computational Linguistics
CY  - Stroudsburg, PA, USA
PY  - 2005/
M2  - 
VL  - 
IS  - 
SP  - 363
EP  - 370
UR  - http://dx.doi.org/10.3115/1220575.1220621
M3  - 10.3115/1220575.1220621
KW  - disambiguation
KW  - extraction
KW  - geo
KW  - map
KW  - news
KW  - toponym
L1  - 
SN  - 
N1  - 
N1  - 
AB  - This research is aimed at the problem of disambiguating toponyms (place names) in terms of a classification derived by merging information from two publicly available gazetteers. To establish the difficulty of the problem, we measured the degree of ambiguity, with respect to a gazetteer, for toponyms in news. We found that 67.82% of the toponyms found in a corpus that were ambiguous in a gazetteer lacked a local discriminator in the text. Given the scarcity of human-annotated data, our method used unsupervised machine learning to develop disambiguation rules. Toponyms were automatically tagged with information about them found in a gazetteer. A toponym that was ambiguous in the gazetteer was automatically disambiguated based on preference heuristics. This automatically tagged data was used to train a machine learner, which disambiguated toponyms in a human-annotated news corpus at 78.5% accuracy.
ER  -

TY  - CONF
AU  - Clough, Paul
AU  - Sanderson, Mark
A2  - 
T1  - A proposal for comparative evaluation of automatic annotation for geo-referenced documents
T2  - Proceedings of the Workshop on Geographic Information Retrieval
PB  - 
CY  - 
PY  - 2004/07
M2  - 
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://eprints.whiterose.ac.uk/4522/
M3  - 
KW  - annotation
KW  - extraction
KW  - geo
KW  - map
L1  - 
SN  - 
N1  - geoEval
N1  - 
AB  - 
ER  -

TY  - CONF
AU  - Kristjansson, Trausti T.
AU  - Culotta, Aron
AU  - Viola, Paul A.
AU  - McCallum, Andrew
A2  - McGuinness, Deborah L.
A2  - Ferguson, George
T1  - Interactive Information Extraction with Constrained Conditional Random Fields.
T2  - AAAI
PB  - AAAI Press/The MIT Press
CY  - 
PY  - 2004/
M2  - 
VL  - 
IS  - 
SP  - 412
EP  - 418
UR  - http://dblp.uni-trier.de/db/conf/aaai/aaai2004.html#KristjanssonCVM04
M3  - 
KW  - crf
KW  - extraction
KW  - ie
KW  - information
L1  - 
SN  - 0-262-51183-5
N1  - 
N1  - 
AB  - ﻿Information Extraction methods can be used to automatically "fill-in" database forms from unstructured  data such as Web documents or email. State-of-the-art  methods have achieved low error rates but invariably  make a number of errors. The goal of an interactive  information extraction system is to assist the user in filling  in database fields while giving the user confidence  in the integrity of the data. The user is presented with  an interactive interface that allows both the rapid verification  of automatic field assignments and the correction  of errors. In cases where there are multiple errors, our  system takes into account user corrections, and immediately  propagates these constraints such that other fields  are often corrected automatically.  Linear-chain conditional random fields (CRFs) have  been shown to perform well for information extraction  and other language modelling tasks due to their ability  to capture arbitrary, overlapping features of the input in  aMarkov model. We apply this framework with two extensions:  a constrained Viterbi decoding which finds the  optimal field assignments consistent with the fields explicitly  specified or corrected by the user; and a mechanism  for estimating the confidence of each extracted  field, so that low-confidence extractions can be highlighted.  Both of these mechanisms are incorporated in a  novel user interface for form filling that is intuitive and  speeds the entry of data—providing a 23% reduction in  error due to automated corrections.
ER  -

TY  - CONF
AU  - Lafferty, John D.
AU  - McCallum, Andrew
AU  - Pereira, Fernando C. N.
A2  - 
T1  - Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
T2  - Proceedings of the Eighteenth International Conference on Machine Learning
PB  - Morgan Kaufmann Publishers Inc.
CY  - San Francisco, CA, USA
PY  - 2001/
M2  - 
VL  - 
IS  - 
SP  - 282
EP  - 289
UR  - http://dl.acm.org/citation.cfm?id=645530.655813
M3  - 
KW  - classification
KW  - crf
KW  - extraction
KW  - ie
KW  - information
KW  - probability
L1  - 
SN  - 1-55860-778-1
N1  - 
N1  - 
AB  - 
ER  -

TY  - JOUR
AU  - Tejada, Sheila
AU  - Knoblock, Craig A
AU  - Minton, Steven
T1  - Learning object identification rules for information integration
JO  - Information Systems
PY  - 2001/12
VL  - 26
IS  - 8
SP  - 607
EP  - 633
UR  - http://www.sciencedirect.com/science/article/pii/S0306437901000424
M3  - 10.1016/S0306-4379(01)00042-4
KW  - detection
KW  - duplicate
KW  - entity
KW  - extraction
KW  - identification
KW  - information
KW  - integration
L1  - 
SN  - 
N1  - 
N1  - 
AB  - When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it difficult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects’ shared attributes in order to identify matching objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous methods of object identification have required manual construction of object identification rules or mapping rules for determining the mappings between objects. This manual process is time consuming and error-prone. In our approach. Active Atlas learns to tailor mapping rules, through limited user input, to a specific application domain. The experimental results demonstrate that we achieve higher accuracy and require less user involvement than previous methods across various application domains.
ER  -

TY  - CONF
AU  - Tezuka, T.
AU  - Lee, Ryong
AU  - Kambayashi, Y.
AU  - Takakura, H.
A2  - 
T1  - Web-based inference rules for processing conceptual geographical relationships
T2  - Proceedings of the Second International Conference on Web Information Systems Engineering
PB  - 
CY  - 
PY  - 2001/12
M2  - 
VL  - 2
IS  - 
SP  - 14
EP  - 21
UR  - http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=996692&tag=1
M3  - 10.1109/WISE.2001.996692
KW  - extraction
KW  - geo
KW  - map
KW  - relation
KW  - web
L1  - 
SN  - 
N1  - 
N1  - 
AB  - Dealing with prepositions such as "near", "between" and "in front of" is very important in geographic information systems (GISs). In most systems, real-world distances are used to handle these prepositions. One of the difficulties in processing these prepositions lies in the fact that their geographical range is distorted in people's cognitive maps. For example, the size of an area referred to by the preposition "near" gets narrowed when a more famous landmark exists right next to the base geographical object. This is because users are likely to choose the most famous landmark when referring to a certain position. Also, the area referred to by "between" is not a straight line; it curves along the most commonly used pathway between the base objects. The difference in the popularity of geographical objects is the main reason for causing such distortions in cognitive maps. Since there is a large amount of data on the World Wide Web, we believe that such conceptual distortion can be calculated by analyzing Web data. Popularity and co-occurrence rates are calculated through their frequency in Web resources. Inference rules are set to restrict the target of conceptual prepositions using GISs and information obtained from the Web
ER  -