CDAS: a crowdsourcing data analytics system.
Proceedings of the VLDB Endowment, 5(10):1040-1051, 2012.
Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu and Meihui Zhang.
[doi]
[abstract]
[BibTeX]
Some complex problems, such as image tagging and natural language processing, are very challenging for computers, where even state-of-the-art technology is yet able to provide satisfactory accuracy. Therefore, rather than relying solely on developing new and better algorithms to handle such tasks, we look to the crowdsourcing solution -- employing human participation -- to make good the shortfall in current technology. Crowdsourcing is a good supplement to many computer tasks. A complex job may be divided into computer-oriented tasks and human-oriented tasks, which are then assigned to machines and humans respectively.</p> <p>To leverage the power of crowdsourcing, we design and implement a Crowdsourcing Data Analytics System, CDAS. CDAS is a framework designed to support the deployment of various crowdsourcing applications. The core part of CDAS is a quality-sensitive answering model, which guides the crowdsourcing engine to process and monitor the human tasks. In this paper, we introduce the principles of our quality-sensitive model. To satisfy user required accuracy, the model guides the crowdsourcing query engine for the design and processing of the corresponding crowdsourcing jobs. It provides an estimated accuracy for each generated result based on the human workers' historical performances. When verifying the quality of the result, the model employs an online strategy to reduce waiting time. To show the effectiveness of the model, we implement and deploy two analytics jobs on CDAS, a twitter sentiment analytics job and an image tagging job. We use real Twitter and Flickr data as our queries respectively. We compare our approaches with state-of-the-art classification and image annotation techniques. The results show that the human-assisted methods can indeed achieve a much higher accuracy. By embedding the quality-sensitive model into crowdsourcing query engine, we effectively reduce the processing cost while maintaining the required query answer quality.
Webarchivierung und Web Archive Mining: Notwendigkeit, Probleme und Lösungsansätze.
HMD Praxis der Wirtschaftsinformatik, 268, 2009.
Andreas Rauber and Max Kaiser.
[doi]
[abstract]
[BibTeX]
In den letzten Jahren haben Bibliotheken und Archive zunehmend die Aufgabe übernommen, neben konventionellen Publikationen auch Inhalte aus dem World Wide Web zu sammeln, um so diesen wertvollen Teil unseres kulturellen Erbes zu bewahren und wichtige Informationen langfristig verfügbar zu halten. Diese massiven Datensammlungen bieten faszinierende Möglichkeiten, rasch Zugriff auf wichtige Informationen zu bekommen, die im Live-Web bereits verloren gegangen sind. Sie sind eine unentbehrliche Quelle für Wissenschaftler, die in der Zukunft die gesellschaftliche und technologische Entwicklung unserer Zeit nachvollziehen wollen. Auf der anderen Seite stellt eine derartige Datensammlung aber einen völlig neuen Datenbestand dar, der nicht nur rechtliche, sondern auch zahlreiche ethische Fragen betreffend seine Nutzung aufwirft. Diese werden in dem Ausmaß zunehmen, in dem die technischen Möglichkeiten zur automatischen Analyse und Interpretation dieser Daten leistungsfähiger werden. Da sich die meisten Webarchivierungsinitiativen dieser Problematik bewusst sind, bleibt die Nutzung der Daten derzeit meist stark eingeschränkt, oder es wird eine Art von "Opt-Out"-Möglichkeit vorgesehen, wodurch Webseiteninhaber die Aufnahme ihrer Seiten in ein Webarchiv ausschließen können. Mit beiden Ansätzen können Webarchive ihr volles Nutzungspotenzial nicht ausschöpfen. Dieser Artikel beschreibt einleitend kurz die Technologien, die zur Sammlung von Webinhalten zu Archivierungszwecken verwendet werden. Er hinterfragt Annahmen, die die freie Verfügbarkeit der Daten und unterschiedliche Nutzungsarten betreffen. Darauf aufbauend identifiziert er eine Reihe von offenen Fragen, deren Lösung einen breiteren Zugriff und bessere Nutzung von Webarchiven erlauben könnte.
Opinion Mining and Sentiment Analysis.
Foundations and Trends in Information Retrieval, 2(1-2):1-135, 2008.
Bo Pang and Lillian Lee.
[doi]
[abstract]
[BibTeX]
An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object.
Top 10 algorithms in data mining.
Knowledge and Information Systems, 14(1):1-37, 2008.
Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey McLachlan, Angus Ng, Bing Liu, Philip Yu, Zhi-Hua Zhou, Michael Steinbach, David Hand and Dan Steinberg.
[doi]
[abstract]
[BibTeX]
This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM)
in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community.With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current andfurther research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, associationanalysis, and link mining, which are all among the most important topics in data mining research and development.
FCA-based approach for mining contextualized folksonomy.
In:
SAC '07: Proceedings of the 2007 ACM symposium on Applied computing, pages 1340-1345.
ACM Press, New York, NY, USA, 2007.
Hak Lae Kim, Suk Hyung Hwang and Hong Gee Kim.
[doi]
[BibTeX]
Educational data mining: A survey from 1995 to 2005.
Expert Syst. Appl., 33(1):135-146, 2007.
C. Romero and S. Ventura.
[doi]
[abstract]
[BibTeX]
Currently there is an increasing interest in data mining and educational systems, making educational data mining as a new growing research community. This paper surveys the application of data mining to traditional educational systems, particular web-based courses, well-known learning content management systems, and adaptive and intelligent web-based educational systems. Each of these systems has different data source and objectives for knowledge discovering. After preprocessing the available data in each case, data mining techniques can be applied: statistics and visualization; clustering, classification and outlier detection; association rule mining and pattern mining; and text mining. The success of the plentiful work needs much more specialized work in order for educational data mining to become a mature area.
Information Retrieval in Folksonomies: Search and Ranking.
In: Y. Sure and J. Domingue, editors,
The Semantic Web: Research and Applications, volume 4011, series Lecture Notes in Computer Science, pages 411-426.
Springer, Heidelberg, 2006.
Andreas Hotho, Robert Jäschke, Christoph Schmitz and Gerd Stumme.
[abstract]
[BibTeX]
Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. The reason for their immediate success is the fact that no specific skills are needed for participating. At the moment, however, the information retrieval support is limited. We present a formal model and a new search algorithm for folksonomies, called FolkRank, that exploits the structure of the folksonomy. The proposed algorithm is also applied to find communities within the folksonomy and is used to structure search results. All findings are demonstrated on a large scale dataset.
Fast and Memory Efficient Mining of Frequent Closed Itemsets.
IEEE Transactions On Knowledge and Data Engineering, 18(1):21-36, 2006.
Claudio Lucchese, Salvatore Orlando and Raffaele Perego.
[BibTeX]
Mining Association Rules in Folksonomies.
In: V. Batagelj, H.-H. Bock, A. Ferligoj and A. Žiberna, editors,
Data Science and Classification, series Studies in Classification, Data Analysis, and Knowledge Organization, pages 261-270.
Springer, Berlin, Heidelberg, 2006.
Christoph Schmitz, Andreas Hotho, Robert Jäschke and Gerd Stumme.
[BibTeX]
Proc. of the European Web Mining Forum 2005.
2005.
Bettina Berendt, Andreas Hotho, Dunja Mladenic, Giovanni Semerano, Myra Spiliopoulou, Gerd Stumme and Maarten van Someren.
[doi]
[BibTeX]
Semantic Web Mining and the Representation, Analysis, and Evolution of Web Space.
In: V. Svatek and V. Snasel, editors,
Proc. of the 1st Intl. Workshop on Representation and Analysis of Web Space, pages 1-16.
Technical University of Ostrava, 2005.
Bettina Berendt, Andreas Hotho and Gerd Stumme.
[BibTeX]
Usage Mining for and on the Semantic Web.
In:
H. Kargupta, A. Joshi, K. Sivakumar and Y. Yesha, editors,
Data Mining Next Generation Challenges and Future Directions, pages 461-481.
AAAI Press, Boston, 2004.
Bettina Berendt, Andreas Hotho and Gerd Stumme.
[doi]
[abstract]
[BibTeX]
Semantic Web Mining aims at combining the two fast-developing
research areas Semantic Web and Web Mining.
Web Mining aims at discovering insights about the meaning of Web
resources and their usage. Given the primarily syntactical nature
of data Web mining operates on, the discovery of meaning is
impossible based on these data only. Therefore, formalizations of
the semantics of Web resources and navigation behavior are
increasingly being used. This fits exactly with the aims of the
Semantic Web: the Semantic Web enriches the WWW by
machine-processable information which supports the user in his
tasks. In this paper, we discuss the interplay of the Semantic Web
with Web Mining, with a specific focus on usage mining.
An Efficient Parallel and Distributed Algorithm for Counting Frequent Sets.
In:
High Performance Computing for Computational Science — VECPAR 2002, pages 3-29.
2003.
Salvatore Orlando, Paolo Palmerini, Raffaele Perego and Fabrizio Silvestri.
[doi]
[abstract]
[BibTeX]
Due to the huge increase in the number and dimension of available databases, efficient solutions for counting frequent sets
are nowadays very important within the Data Mining community. Several sequential and parallel algorithms were proposed, whichin many cases exhibit excellent scalability. In this paper we present ParDCI, a distributed and multithreaded algorithm forcounting the occurrences of frequent sets within transactional databases. ParDCI is a parallel version of DCI (Direct Count& Intersect), a multi-strategy algorithm which is able to adapt its behavior not only to the features of the specific computingplatform (e.g. available memory), but also to the features of the dataset being processed (e.g. sparse or dense datasets).ParDCI enhances previous proposals by exploiting the highly optimized counting and intersection techniques of DCI, and byrelying on a multi-level parallelization approachwh ichex plicitly targets clusters of SMPs, an emerging computing platform.We focused our work on the efficient exploitation of the underlying architecture. Intra-Node multithreading effectively exploitsthe memory hierarchies of each SMP node, while Inter-Node parallelism exploits smart partitioning techniques aimed at reducingcommunication overheads. In depth experimental evaluations demonstrate that ParDCI reaches nearly optimal performances undera variety of conditions.
Towards Semantic Web Mining.
In: I. Horrocks and J. Hendler, editors,
The Semantic Web - ISWC 2002, series LNCS, pages 264-278.
Springer, Heidelberg, 2002.
B. Berendt, A. Hotho and G. Stumme.
[doi]
[BibTeX]
Semantic Web Mining for Building Information Portals (Position Paper).
In:
Proc. Arbeitskreistreffen Knowledge Discovery.
Oldenburg, 2002.
J. Hartmann, A. Hotho and G. Stumme.
[doi]
[BibTeX]
Usage Mining for and on the Semantic Web.
In:
Proc. NSF Workshop on Next Generation Data Mining, pages 77-86.
Baltimore, 2002.
G. Stumme, B. Berendt and A. Hotho.
[doi]
[BibTeX]
A novel Web usage mining approach for search engines.
Computer Networks, 39(3):303-310, 2002.
Dell Zhang and Yisheng Dong.
[doi]
[abstract]
[BibTeX]
Web usage mining can be very useful to search engines. This paper proposes a novel effective approach to exploit the relationships among users, queries and resources based on the search engine's log. How this method can be applied is illustrated by a Chinese image search engine.
Semantic Web Mining.
Freiburg, 2001.
G. Stumme, A. Hotho and B. Berendt.
[doi]
[BibTeX]
The Visual Display of Quantitative Information.
2001.
Edward R. Tufte.
[doi]
[BibTeX]
From data mining to knowledge discovery: an overview.
In:
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, editors,
Advances in knowledge discovery and data mining, pages 1-34.
American Association for Artificial Intelligence, Menlo Park, CA, USA, 1996.
Usama M. Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth.
[doi]
[abstract]
[BibTeX]
Data mining and knowledge discovery in
databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging
field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article
mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge
discovery, and current and future research directions in the field.