TY - JOUR AU - Berendt, Bettina AU - Hotho, Andreas AU - Stumme, Gerd T1 - Bridging the Gap--Data Mining and Social Network Analysis for Integrating Semantic Web and Web 2.0 JO - Web Semantics: Science, Services and Agents on the World Wide Web PY - 2010/ VL - 8 IS - 2-3 SP - 95 EP - 96 UR - http://www.sciencedirect.com/science/article/B758F-4YXK4HW-1/2/4cb514565477c54160b5e6eb716c32d7 M3 - DOI: 10.1016/j.websem.2010.04.008 KW - bridge KW - semantic KW - semantic_web KW - social KW - social_web KW - web KW - web2.0 KW - ol_web2.0 KW - background L1 - SN - N1 - N1 - AB - ER - TY - JOUR AU - Carpineto, Claudio AU - Osiński, Stanislaw AU - Romano, Giovanni AU - Weiss, Dawid T1 - A survey of Web clustering engines JO - ACM Comput. Surv. PY - 2009/07 VL - 41 IS - SP - 17:1 EP - 17:38 UR - http://doi.acm.org/10.1145/1541880.1541884 M3 - 10.1145/1541880.1541884 KW - clustering KW - web KW - bachelor:2011:bachmann L1 - SN - N1 - N1 - AB - Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization. Search results clustering, the core of the system, has specific requirements that cannot be addressed by classical clustering algorithms. We emphasize the role played by the quality of the cluster labels as opposed to optimizing only the clustering structure. We highlight the main characteristics of a number of existing Web clustering engines and also discuss how to evaluate their retrieval performance. Some directions for future research are finally presented. ER - TY - JOUR AU - Hazman, Maryam AU - El-Beltagy, Samhaa R. AU - Rafea, Ahmed T1 - Ontology learning from domain specific web documents JO - International Journal of Metadata, Semantics and Ontologies PY - 2009/ VL - 4 IS - SP - 24 EP - 33(10) UR - http://www.ingentaconnect.com/content/ind/ijmso/2009/00000004/F0020001/art00003 M3 - doi:10.1504/IJMSO.2009.026251 KW - ontology_learning KW - toread KW - web L1 - SN - N1 - IngentaConnect Ontology learning from domain specific web documents N1 - AB - Ontologies play a vital role in many web- and internet-related applications. This work presents a system for accelerating the ontology building process via semi-automatically learning a hierarchal ontology given a set of domain-specific web documents and a set of seed concepts. The methods are tested with web documents in the domain of agriculture. The ontology is constructed through the use of two complementary approaches. The presented system has been used to build an ontology in the agricultural domain using a set of Arabic extension documents and evaluated against a modified version of the AGROVOC ontology. ER - TY - CONF AU - Lu, Caimei AU - Chen, Xin AU - Park, E. K. A2 - T1 - Exploit the tripartite network of social tagging for web clustering T2 - Proceeding of the 18th ACM conference on Information and knowledge management PB - ACM CY - New York, NY, USA PY - 2009/ M2 - VL - IS - SP - 1545 EP - 1548 UR - http://doi.acm.org/10.1145/1645953.1646167 M3 - 10.1145/1645953.1646167 KW - clustering KW - web KW - bachelor:2011:bachmann L1 - SN - 978-1-60558-512-3 N1 - N1 - AB - In this poster, we investigate how to enhance web clustering by leveraging the tripartite network of social tagging systems. We propose a clustering method, called "Tripartite Clustering", which cluster the three types of nodes (resources, users and tags) simultaneously based on the links in the social tagging network. The proposed method is experimented on a real-world social tagging dataset sampled from del.icio.us. We also compare the proposed clustering approach with K-means. All the clustering results are evaluated against a human-maintained web directory. The experimental results show that Tripartite Clustering significantly outperforms the content-based K-means approach and achieves performance close to that of social annotation-based K-means whereas generating much more useful information. ER - TY - JOUR AU - Qi, Xiaoguang AU - Davison, Brian D. T1 - Web page classification: Features and algorithms JO - ACM Comput. Surv. PY - 2009/02 VL - 41 IS - SP - 12:1 EP - 12:31 UR - http://doi.acm.org/10.1145/1459352.1459357 M3 - 10.1145/1459352.1459357 KW - bachelor:2011:bachmann KW - classification KW - page KW - web L1 - SN - N1 - Web page classification N1 - AB - Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.

As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages. ER - TY - BOOK AU - A2 - Alani, Harith A2 - Staab, Steffen A2 - Stumme, Gerd T1 - Proceedings of the Dagstuhl Seminar on Social Web Communities PB - Schloss Dagstuhl AD - PY - 2008/10 VL - IS - SP - EP - UR - http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=08391 M3 - KW - 2008 KW - communities KW - dagstuhl KW - social KW - web KW - tagorapub KW - tagora KW - itegpub L1 - SN - N1 - N1 - AB - ER - TY - CONF AU - Krause, Beate AU - Schmitz, Christoph AU - Hotho, Andreas AU - Stumme, Gerd A2 - T1 - The Anti-Social Tagger - Detecting Spam in Social Bookmarking Systems T2 - Proc. of the Fourth International Workshop on Adversarial Information Retrieval on the Web PB - CY - PY - 2008/ M2 - VL - IS - SP - EP - UR - http://airweb.cse.lehigh.edu/2008/submissions/krause_2008_anti_social_tagger.pdf M3 - KW - 2008 KW - systems KW - bookmarking KW - web KW - tagger KW - 2.0 KW - itegpub KW - social KW - web2.0 KW - folksonomy KW - folksonomies KW - tagorapub KW - spam L1 - SN - N1 - N1 - AB - ER - TY - CONF AU - Bollegala, Danushka AU - Matsuo, Yutaka AU - Ishizuka, Mitsuru A2 - T1 - Measuring semantic similarity between words using web search engines T2 - WWW '07: Proceedings of the 16th international conference on World Wide Web PB - ACM CY - New York, NY, USA PY - 2007/ M2 - VL - IS - SP - 757 EP - 766 UR - http://www2007.org/papers/paper632.pdf M3 - http://doi.acm.org/10.1145/1242572.1242675 KW - words KW - terms KW - similarity KW - semantic KW - web KW - toread KW - search_engine L1 - SN - 978-1-59593-654-7 N1 - Measuring semantic similarity between words using web search engines N1 - AB - ER - TY - CONF AU - Rattenbury, Tye AU - Good, Nathaniel AU - Naaman, Mor A2 - T1 - Towards automatic extraction of event and place semantics from flickr tags T2 - SIGIR '07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval PB - ACM Press CY - New York, NY, USA PY - 2007/ M2 - VL - IS - SP - 103 EP - 110 UR - http://dx.doi.org/10.1145/1277741.1277762 M3 - 10.1145/1277741.1277762 KW - emerging KW - event KW - extraction KW - flickr KW - folksonomy KW - geo KW - learning KW - ontology KW - place KW - semantic KW - web KW - ol_web2.0 KW - methods_concepts L1 - SN - 978-1-59593-597-7 N1 - N1 - AB - We describe an approach for extracting semantics of tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns. In particular, we focus on the problem of extracting place and event semantics for tags that are assigned to photos on Flickr, a popular photo sharing website that supports time and location (latitude/longitude) metadata. We analyze two methods inspired by well-known burst-analysis techniques and one novel method: Scale-structure Identification. We evaluate the methods on a subset of Flickr data, and show that our Scale-structure Identification method outperforms the existing techniques. The approach and methods described in this work can be used in other domains such as geo-annotated web pages, where text terms can be extracted and associated with usage patterns. ER - TY - CONF AU - Angelova, Ralitsa AU - Weikum, Gerhard A2 - T1 - Graph-based text classification: learn from your neighbors T2 - Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval PB - ACM CY - New York, NY, USA PY - 2006/ M2 - VL - IS - SP - 485 EP - 492 UR - http://doi.acm.org/10.1145/1148170.1148254 M3 - 10.1145/1148170.1148254 KW - bachelor:2011:bachmann KW - classification KW - web L1 - SN - 1-59593-369-7 N1 - Graph-based text classification N1 - AB - Automatic classification of data items, based on training samples, can be boosted by considering the neighborhood of data items in a graph structure (e.g., neighboring documents in a hyperlink environment or co-authors and their publications for bibliographic data entries). This paper presents a new method for graph-based classification, with particular emphasis on hyperlinked text documents but broader applicability. Our approach is based on iterative relaxation labeling and can be combined with either Bayesian or SVM classifiers on the feature spaces of the given data items. The graph neighborhood is taken into consideration to exploit locality patterns while at the same time avoiding overfitting. In contrast to prior work along these lines, our approach employs a number of novel techniques: dynamically inferring the link/class pattern in the graph in the run of the iterative relaxation labeling, judicious pruning of edges from the neighborhood graph based on node dissimilarities and node degrees, weighting the influence of edges based on a distance metric between the classification labels of interest and weighting edges by content similarity measures. Our techniques considerably improve the robustness and accuracy of the classification outcome, as shown in systematic experimental comparisons with previously published methods on three different real-world datasets. ER - TY - CONF AU - Liu, Vinci AU - Curran, James R. A2 - T1 - Web Text Corpus for Natural Language Processing. T2 - EACL PB - The Association for Computer Linguistics CY - PY - 2006/ M2 - VL - IS - SP - EP - UR - http://dblp.uni-trier.de/db/conf/eacl/eacl2006.html#LiuC06 M3 - KW - corpus KW - dataset KW - web KW - synonym_detection KW - nlp L1 - SN - 1-932432-59-0 N1 - dblp N1 - AB - ER - TY - CHAP AU - Choi, B. AU - Yao, Z. A2 - Chu, Wesley A2 - Young Lin, Tsau T1 - Web Page Classification T2 - Foundations and Advances in Data Mining PB - Springer CY - Berlin / Heidelberg PY - 2005/ VL - 180 IS - SP - 221 EP - 274 UR - http://dx.doi.org/10.1007/11362197_9 M3 - 10.1007/11362197_9 KW - bachelor:2011:bachmann KW - classification KW - page KW - web L1 - SN - 978-3-540-25057-9 N1 - SpringerLink - Abstract N1 - AB - This chapter describes systems that automatically classify web pages into meaningful categories. It first defines two types of web page classification: subject based and genre based classifications. It then describes the state of the art techniques and subsystems used to build automatic web page classification systems, including web page representations, dimensionality reductions, web page classifiers, and evaluation of web page classifiers. Such systems are essential tools for Web Mining and for the future of Semantic Web. ER - TY - CONF AU - LIU, Tie-Yan AU - YANG, Yiming AU - WAN, Hao AU - ZHOU, Qian AU - GAO, Bin AU - ZENG, Hua-Jun AU - CHEN, Zheng AU - MA, Wei-Ying A2 - T1 - An experimental study on large-scale web categorization T2 - Special interest tracks and posters of the 14th international conference on World Wide Web PB - ACM CY - New York, NY, USA PY - 2005/ M2 - VL - IS - SP - 1106 EP - 1107 UR - http://doi.acm.org/10.1145/1062745.1062891 M3 - 10.1145/1062745.1062891 KW - categorization KW - web KW - bachelor:2011:bachmann L1 - SN - 1-59593-051-5 N1 - An experimental study on large-scale web categorization N1 - AB - Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distribution over documents. It is not clear whether existing text classification technologies can perform well on and scale up to such large-scale applications. To understand this, we conducted the evaluation of several representative methods (Support Vector Machines, k-Nearest Neighbor and Naive Bayes) with Yahoo! taxonomies. In particular, we evaluated the effectiveness/efficiency tradeoff in classifiers with hierarchical setting compared to conventional (flat) setting, and tested popular threshold tuning strategies for their scalability and accuracy in large-scale classification problems. ER - TY - CONF AU - Shen, Dou AU - Chen, Zheng AU - Yang, Qiang AU - Zeng, Hua-Jun AU - Zhang, Benyu AU - Lu, Yuchang AU - Ma, Wei-Ying A2 - T1 - Web-page classification through summarization T2 - Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval PB - ACM CY - New York, NY, USA PY - 2004/ M2 - VL - IS - SP - 242 EP - 249 UR - http://doi.acm.org/10.1145/1008992.1009035 M3 - 10.1145/1008992.1009035 KW - classification KW - web KW - bachelor:2011:bachmann L1 - SN - 1-58113-881-4 N1 - Web-page classification through summarization N1 - AB - Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods. ER - TY - CONF AU - Omelayenko, Borys A2 - T1 - Learning of Ontologies for the Web: the Analysis of Existent Approaches T2 - Proceedings of the International Workshop on Web Dynamics, held in conj. with the 8th International Conference on Database Theory (ICDT’01), London, UK PB - CY - PY - 2001/ M2 - VL - IS - SP - EP - UR - http://www.dcs.bbk.ac.uk/webDyn/webDynPapers/omelayenko.pdf M3 - KW - ol_web2.0 KW - ontology_learning KW - overview KW - web L1 - SN - N1 - N1 - AB - The next generation of the Web, called Semantic Web, has to improve the Web with semantic (ontological) page annotations to enable knowledge-level querying and searches. Manual construction of these ontologies will require tremendous efforts that force future integration of machine learning with knowledge acquisition to enable highly automated ontology learning. In the paper we present the state of the-art in the field of ontology learning from the Web to see how it can contribute to the task of semantic Web querying. We consider three components of the query processing system: natural language ontologies, domain ontologies and ontology instances. We discuss the requirements for machine learning algorithms to be applied for the learning of the ontologies of each type from the Web documents, and survey the existent ontology learning and other closely related approaches. ER - TY - CONF AU - Dumais, Susan AU - Chen, Hao A2 - T1 - Hierarchical classification of Web content T2 - Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval PB - ACM CY - New York, NY, USA PY - 2000/ M2 - VL - IS - SP - 256 EP - 263 UR - http://doi.acm.org/10.1145/345508.345593 M3 - 10.1145/345508.345593 KW - classification KW - web KW - bachelor:2011:bachmann L1 - SN - 1-58113-226-3 N1 - N1 - AB - ER - TY - JOUR AU - Chakrabarti, Soumen AU - Dom, Byron AU - Agrawal, Rakesh AU - Raghavan, Prabhakar T1 - Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies JO - The VLDB Journal PY - 1998/08 VL - 7 IS - SP - 163 EP - 178 UR - http://dx.doi.org/10.1007/s007780050061 M3 - 10.1007/s007780050061 KW - bachelor:2011:bachmann KW - organization KW - web L1 - SN - N1 - Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies N1 - AB - We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called taxonomies. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the feature words, or discriminants, from thenoise words at each node of the taxonomy. Using these, we build a multilevel classifier. At each node, this classifier can ignore the large number of “noise” words in a document. Thus, the classifier has a small model size and is very fast. Owing to the use of context-sensitive features, the classifier is very accurate. As a by-product, we can compute for each document a set of terms that occur significantly more often in it than in the classes to which it belongs. We describe the design and implementation of our system, stressing how to exploit standard, efficient relational operations like sorts and joins. We report on experiences with the Reuters newswire benchmark, the US patent database, and web document samples from Yahoo!. We discuss applications where our system can improve searching and filtering capabilities. ER -