Crowdsourcing systems on the World-Wide Web.
Communications of the ACM, 54(4):86-96, 2011.
Anhai Doan, Raghu Ramakrishnan und Alon Y. Halevy.
[doi]
[Kurzfassung]
[BibTeX]
The practice of crowdsourcing is transforming the Web and giving rise to a new field.
Efficiently incorporating user feedback into information extraction and integration programs.
In:
Proceedings of the 35th SIGMOD international conference on Management of data, Seiten 87-100.
ACM, New York, NY, USA, 2009.
Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan und Jeffrey F. Naughton.
[doi]
[Kurzfassung]
[BibTeX]
Many applications increasingly employ information extraction and integration (IE/II) programs to infer structures from unstructured data. Automatic IE/II are inherently imprecise. Hence such programs often make many IE/II mistakes, and thus can significantly benefit from user feedback. Today, however, there is no good way to automatically provide and process such feedback. When finding an IE/II mistake, users often must alert the developer team (e.g., via email or Web form) about the mistake, and then wait for the team to manually examine the program internals to locate and fix the mistake, a slow, error-prone, and frustrating process.</p> <p>In this paper we propose a solution for users to directly provide feedback and for IE/II programs to automatically process such feedback. In our solution a developer <i>U</i> uses hlog, a declarative IE/II language, to write an IE/II program <i>P</i>. Next, <i>U</i> writes declarative user feedback rules that specify which parts of <i>P</i>'s data (e.g., input, intermediate, or output data) users can edit, and via which user interfaces. Next, the so-augmented program <i>P</i> is executed, then enters a loop of waiting for and incorporating user feedback. Given user feedback <i>F</i> on a data portion of <i>P</i>, we show how to automatically propagate <i>F</i> to the rest of <i>P</i>, and to seamlessly combine <i>F</i> with prior user feedback. We describe the syntax and semantics of hlog, a baseline execution strategy, and then various optimization techniques. Finally, we describe experiments with real-world data that demonstrate the promise of our solution.
Information extraction challenges in managing unstructured data.
SIGMOD Record, 37(4):14-20, 2009.
AnHai Doan, Jeffrey F. Naughton, Raghu Ramakrishnan, Akanksha Baid, Xiaoyong Chai, Fei Chen, Ting Chen, Eric Chu, Pedro DeRose, Byron Gao, Chaitanya Gokhale, Jiansheng Huang, Warren Shen und Ba-Quy Vuong.
[doi]
[Kurzfassung]
[BibTeX]
Over the past few years, we have been trying to build an end-to-end system at Wisconsin to manage unstructured data, using extraction, integration, and user interaction. This paper describes the key information extraction (IE) challenges that we have run into, and sketches our solutions. We discuss in particular developing a declarative IE language, optimizing for this language, generating IE provenance, incorporating user feedback into the IE process, developing a novel wiki-based user interface for feedback, best-effort IE, pushing IE into RDBMSs, and more. Our work suggests that IE in managing unstructured data can open up many interesting research challenges, and that these challenges can greatly benefit from the wealth of work on managing structured data that has been carried out by the database community.
iMAP: Discovering Complex Mappings between Database Schemas..
In: G. Weikum, A. C. König und S. Deßloch
(Herausgeber):
SIGMOD Conference, Seiten 383-394.
ACM, 2004.
Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Y. Halevy und Pedro Domingos.
[doi]
[BibTeX]
Learning to Map between Ontologies on the Semantic Web.
In:
Proceedings to the Eleventh International World Wide.
Honolulu, Hawaii, USA, 2002.
AnHai Doan, Jayant Madhavan, Pedro Domingos und Alon Halevy.
[doi]
[BibTeX]
Reconciling schemas of disparate data sources: a machine-learning approach.
SIGMOD Rec., 30(2):509-520, 2001.
AnHai Doan, Pedro Domingos und Alon Y. Halevy.
[doi]
[Kurzfassung]
[BibTeX]
A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that employs and extends current machine-learning techniques to semi-automatically find such mappings. LSD first asks the user to provide the semantic mappings for a small set of data sources, then uses these mappings together with the sources to train a set of learners. Each learner exploits a different type of information either in the source schemas or in their data. Once the learners have been trained, LSD finds semantic mappings for a new data source by applying the learners, then combining their predictions using a meta-learner. To further improve matching accuracy, we extend machine learning techniques so that LSD can incorporate domain constraints as an additional source of knowledge, and develop a novel learner that utilizes the structural information in XML documents. Our approach thus is distinguished in that it incorporates multiple types of knowledge. Importantly, its architecture is extensible to additional learners that may exploit new kinds of information. We describe a set of experiments on several real-world domains, and show that LSD proposes semantic mappings with a high degree of accuracy.