TY  - JOUR
AU  - Berendt, Bettina
AU  - Hotho, Andreas
AU  - Stumme, Gerd
T1  - Bridging the Gap--Data Mining and Social Network Analysis for Integrating Semantic Web and Web 2.0
JO  - Web Semantics: Science, Services and Agents on the World Wide Web
PY  - 2010/
VL  - 8
IS  - 2-3
SP  - 95 
EP  -  96
UR  - http://www.sciencedirect.com/science/article/B758F-4YXK4HW-1/2/4cb514565477c54160b5e6eb716c32d7
M3  - DOI: 10.1016/j.websem.2010.04.008
KW  - bridge
KW  - semantic
KW  - semantic_web
KW  - social
KW  - social_web
KW  - web
KW  - web2.0
KW  - ol_web2.0
KW  - background
L1  - 
SN  - 
N1  - 
N1  - 
AB  - 
ER  -

TY  - JOUR
AU  - Carpineto, Claudio
AU  - Osi&#324;ski, Stanislaw
AU  - Romano, Giovanni
AU  - Weiss, Dawid
T1  - A survey of Web clustering engines
JO  - ACM Comput. Surv.
PY  - 2009/07
VL  - 41
IS  - 
SP  - 17:1
EP  - 17:38
UR  - http://doi.acm.org/10.1145/1541880.1541884
M3  - 10.1145/1541880.1541884
KW  - clustering
KW  - web
KW  - bachelor:2011:bachmann
L1  - 
SN  - 
N1  - 
N1  - 
AB  - Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization. Search results clustering, the core of the system, has specific requirements that cannot be addressed by classical clustering algorithms. We emphasize the role played by the quality of the cluster labels as opposed to optimizing only the clustering structure. We highlight the main characteristics of a number of existing Web clustering engines and also discuss how to evaluate their retrieval performance. Some directions for future research are finally presented.
ER  -

TY  - JOUR
AU  - Hazman, Maryam
AU  - El-Beltagy, Samhaa R.
AU  - Rafea, Ahmed
T1  - Ontology learning from domain specific web documents
JO  - International Journal of Metadata, Semantics and Ontologies
PY  - 2009/
VL  - 4
IS  - 
SP  - 24
EP  - 33(10)
UR  - http://www.ingentaconnect.com/content/ind/ijmso/2009/00000004/F0020001/art00003
M3  - doi:10.1504/IJMSO.2009.026251
KW  - ontology_learning
KW  - toread
KW  - web
L1  - 
SN  - 
N1  - IngentaConnect Ontology learning from domain specific web documents
N1  - 
AB  - Ontologies play a vital role in many web- and internet-related applications. This work presents a system for accelerating the ontology building process via semi-automatically learning a hierarchal ontology given a set of domain-specific web documents and a set of seed concepts. The methods are tested with web documents in the domain of agriculture. The ontology is constructed through the use of two complementary approaches. The presented system has been used to build an ontology in the agricultural domain using a set of Arabic extension documents and evaluated against a modified version of the AGROVOC ontology.
ER  -

TY  - CONF
AU  - Lu, Caimei
AU  - Chen, Xin
AU  - Park, E. K.
A2  - 
T1  - Exploit the tripartite network of social tagging for web clustering
T2  - Proceeding of the 18th ACM conference on Information and knowledge management
PB  - ACM
CY  - New York, NY, USA
PY  - 2009/
M2  - 
VL  - 
IS  - 
SP  - 1545
EP  - 1548
UR  - http://doi.acm.org/10.1145/1645953.1646167
M3  - 10.1145/1645953.1646167
KW  - clustering
KW  - web
KW  - bachelor:2011:bachmann
L1  - 
SN  - 978-1-60558-512-3
N1  - 
N1  - 
AB  - In this poster, we investigate how to enhance web clustering by leveraging the tripartite network of social tagging systems. We propose a clustering method, called "Tripartite Clustering", which cluster the three types of nodes (resources, users and tags) simultaneously based on the links in the social tagging network. The proposed method is experimented on a real-world social tagging dataset sampled from del.icio.us. We also compare the proposed clustering approach with K-means. All the clustering results are evaluated against a human-maintained web directory. The experimental results show that Tripartite Clustering significantly outperforms the content-based K-means approach and achieves performance close to that of social annotation-based K-means whereas generating much more useful information.
ER  -

TY  - JOUR
AU  - Qi, Xiaoguang
AU  - Davison, Brian D.
T1  - Web page classification: Features and algorithms
JO  - ACM Comput. Surv.
PY  - 2009/02
VL  - 41
IS  - 
SP  - 12:1
EP  - 12:31
UR  - http://doi.acm.org/10.1145/1459352.1459357
M3  - 10.1145/1459352.1459357
KW  - bachelor:2011:bachmann
KW  - classification
KW  - page
KW  - web
L1  - 
SN  - 
N1  - Web page classification
N1  - 
AB  - Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.</p> <p>As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
ER  -

TY  - BOOK
AU  - 
A2  - Alani, Harith
A2  - Staab, Steffen
A2  - Stumme, Gerd
T1  - Proceedings of the Dagstuhl Seminar on Social Web Communities
PB  - Schloss Dagstuhl
AD  - 
PY  - 2008/10
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=08391
M3  - 
KW  - 2008
KW  - communities
KW  - dagstuhl
KW  - social
KW  - web
KW  - tagorapub
KW  - tagora
KW  - itegpub
L1  - 
SN  - 
N1  - 
N1  - 
AB  - 
ER  -

TY  - CONF
AU  - Krause, Beate
AU  - Schmitz, Christoph
AU  - Hotho, Andreas
AU  - Stumme, Gerd
A2  - 
T1  - The Anti-Social Tagger - Detecting Spam in Social Bookmarking Systems
T2  - Proc. of the Fourth International Workshop on  Adversarial Information Retrieval on the Web
PB  - 
CY  - 
PY  - 2008/
M2  - 
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://airweb.cse.lehigh.edu/2008/submissions/krause_2008_anti_social_tagger.pdf
M3  - 
KW  - 2008
KW  - systems
KW  - bookmarking
KW  - web
KW  - tagger
KW  - 2.0
KW  - itegpub
KW  - social
KW  - web2.0
KW  - folksonomy
KW  - folksonomies
KW  - tagorapub
KW  - spam
L1  - 
SN  - 
N1  - 
N1  - 
AB  - 
ER  -

TY  - CONF
AU  - Bollegala, Danushka
AU  - Matsuo, Yutaka
AU  - Ishizuka, Mitsuru
A2  - 
T1  - Measuring semantic similarity between words using web search engines
T2  - WWW '07: Proceedings of the 16th international conference on World Wide Web
PB  - ACM
CY  - New York, NY, USA
PY  - 2007/
M2  - 
VL  - 
IS  - 
SP  - 757
EP  - 766
UR  - http://www2007.org/papers/paper632.pdf
M3  - http://doi.acm.org/10.1145/1242572.1242675
KW  - words
KW  - terms
KW  - similarity
KW  - semantic
KW  - web
KW  - toread
KW  - search_engine
L1  - 
SN  - 978-1-59593-654-7
N1  - Measuring semantic similarity between words using web search engines
N1  - 
AB  - 
ER  -

TY  - CONF
AU  - Rattenbury, Tye
AU  - Good, Nathaniel
AU  - Naaman, Mor
A2  - 
T1  - Towards automatic extraction of event and place semantics from flickr tags
T2  - SIGIR '07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
PB  - ACM Press
CY  - New York, NY, USA
PY  - 2007/
M2  - 
VL  - 
IS  - 
SP  - 103
EP  - 110
UR  - http://dx.doi.org/10.1145/1277741.1277762
M3  - 10.1145/1277741.1277762
KW  - emerging
KW  - event
KW  - extraction
KW  - flickr
KW  - folksonomy
KW  - geo
KW  - learning
KW  - ontology
KW  - place
KW  - semantic
KW  - web
KW  - ol_web2.0
KW  - methods_concepts
L1  - 
SN  - 978-1-59593-597-7
N1  - 
N1  - 
AB  - We describe an approach for extracting semantics of tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns. In particular, we focus on the problem of extracting place and event semantics for tags that are assigned to photos on Flickr, a popular photo sharing website that supports time and location (latitude/longitude) metadata. We analyze two methods inspired by well-known burst-analysis techniques and one novel method: Scale-structure Identification. We evaluate the methods on a subset of Flickr data, and show that our Scale-structure Identification method outperforms the existing techniques. The approach and methods described in this work can be used in other domains such as geo-annotated web pages, where text terms can be extracted and associated with usage patterns.
ER  -

TY  - CONF
AU  - Angelova, Ralitsa
AU  - Weikum, Gerhard
A2  - 
T1  - Graph-based text classification: learn from your neighbors
T2  - Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
PB  - ACM
CY  - New York, NY, USA
PY  - 2006/
M2  - 
VL  - 
IS  - 
SP  - 485
EP  - 492
UR  - http://doi.acm.org/10.1145/1148170.1148254
M3  - 10.1145/1148170.1148254
KW  - bachelor:2011:bachmann
KW  - classification
KW  - web
L1  - 
SN  - 1-59593-369-7
N1  - Graph-based text classification
N1  - 
AB  - Automatic classification of data items, based on training samples, can be boosted by considering the neighborhood of data items in a graph structure (e.g., neighboring documents in a hyperlink environment or co-authors and their publications for bibliographic data entries). This paper presents a new method for graph-based classification, with particular emphasis on hyperlinked text documents but broader applicability. Our approach is based on iterative relaxation labeling and can be combined with either Bayesian or SVM classifiers on the feature spaces of the given data items. The graph neighborhood is taken into consideration to exploit locality patterns while at the same time avoiding overfitting. In contrast to prior work along these lines, our approach employs a number of novel techniques: dynamically inferring the link/class pattern in the graph in the run of the iterative relaxation labeling, judicious pruning of edges from the neighborhood graph based on node dissimilarities and node degrees, weighting the influence of edges based on a distance metric between the classification labels of interest and weighting edges by content similarity measures. Our techniques considerably improve the robustness and accuracy of the classification outcome, as shown in systematic experimental comparisons with previously published methods on three different real-world datasets.
ER  -

TY  - CONF
AU  - Liu, Vinci
AU  - Curran, James R.
A2  - 
T1  - Web Text Corpus for Natural Language Processing.
T2  - EACL
PB  - The Association for Computer Linguistics
CY  - 
PY  - 2006/
M2  - 
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://dblp.uni-trier.de/db/conf/eacl/eacl2006.html#LiuC06
M3  - 
KW  - corpus
KW  - dataset
KW  - web
KW  - synonym_detection
KW  - nlp
L1  - 
SN  - 1-932432-59-0
N1  - dblp
N1  - 
AB  - 
ER  -

TY  - CHAP
AU  - Choi, B.
AU  - Yao, Z.
A2  - Chu, Wesley
A2  - Young Lin, Tsau
T1  - Web Page Classification
T2  - Foundations and Advances in Data Mining
PB  - Springer
CY  - Berlin / Heidelberg
PY  - 2005/
VL  - 180
IS  - 
SP  - 221
EP  - 274
UR  - http://dx.doi.org/10.1007/11362197_9
M3  - 10.1007/11362197_9
KW  - bachelor:2011:bachmann
KW  - classification
KW  - page
KW  - web
L1  - 
SN  - 978-3-540-25057-9
N1  - SpringerLink - Abstract
N1  - 
AB  - This chapter describes systems that automatically classify web pages into meaningful categories. It first defines two types of web page classification: subject based and genre based classifications. It then describes the state of the art techniques and subsystems used to build automatic web page classification systems, including web page representations, dimensionality reductions, web page classifiers, and evaluation of web page classifiers. Such systems are essential tools for Web Mining and for the future of Semantic Web.
ER  -

TY  - CONF
AU  - LIU, Tie-Yan
AU  - YANG, Yiming
AU  - WAN, Hao
AU  - ZHOU, Qian
AU  - GAO, Bin
AU  - ZENG, Hua-Jun
AU  - CHEN, Zheng
AU  - MA, Wei-Ying
A2  - 
T1  - An experimental study on large-scale web categorization
T2  - Special interest tracks and posters of the 14th international conference on World Wide Web
PB  - ACM
CY  - New York, NY, USA
PY  - 2005/
M2  - 
VL  - 
IS  - 
SP  - 1106
EP  - 1107
UR  - http://doi.acm.org/10.1145/1062745.1062891
M3  - 10.1145/1062745.1062891
KW  - categorization
KW  - web
KW  - bachelor:2011:bachmann
L1  - 
SN  - 1-59593-051-5
N1  - An experimental study on large-scale web categorization
N1  - 
AB  - Taxonomies of the Web typically have hundreds of thousands of categories and skewed category distribution over documents. It is not clear whether existing text classification technologies can perform well on and scale up to such large-scale applications. To understand this, we conducted the evaluation of several representative methods (Support Vector Machines, <i>k</i>-Nearest Neighbor and Naive Bayes) with Yahoo! taxonomies. In particular, we evaluated the effectiveness/efficiency tradeoff in classifiers with hierarchical setting compared to conventional (flat) setting, and tested popular threshold tuning strategies for their scalability and accuracy in large-scale classification problems.
ER  -

TY  - CONF
AU  - Shen, Dou
AU  - Chen, Zheng
AU  - Yang, Qiang
AU  - Zeng, Hua-Jun
AU  - Zhang, Benyu
AU  - Lu, Yuchang
AU  - Ma, Wei-Ying
A2  - 
T1  - Web-page classification through summarization
T2  - Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
PB  - ACM
CY  - New York, NY, USA
PY  - 2004/
M2  - 
VL  - 
IS  - 
SP  - 242
EP  - 249
UR  - http://doi.acm.org/10.1145/1008992.1009035
M3  - 10.1145/1008992.1009035
KW  - classification
KW  - web
KW  - bachelor:2011:bachmann
L1  - 
SN  - 1-58113-881-4
N1  - Web-page classification through summarization
N1  - 
AB  - Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.
ER  -

TY  - CONF
AU  - Omelayenko, Borys
A2  - 
T1  - Learning of Ontologies for the Web: the Analysis of Existent Approaches
T2  - Proceedings of the International Workshop on Web Dynamics, held in conj. with the 8th International Conference on Database Theory (ICDT’01), London, UK
PB  - 
CY  - 
PY  - 2001/
M2  - 
VL  - 
IS  - 
SP  - 
EP  - 
UR  - http://www.dcs.bbk.ac.uk/webDyn/webDynPapers/omelayenko.pdf
M3  - 
KW  - ol_web2.0
KW  - ontology_learning
KW  - overview
KW  - web
L1  - 
SN  - 
N1  - 
N1  - 
AB  - The next generation of the Web, called Semantic Web, has to improve the Web with semantic (ontological) page annotations to enable knowledge-level querying and searches. Manual construction of these ontologies will require tremendous efforts that force future integration of machine learning with knowledge acquisition to enable highly automated ontology learning. In the paper we present the state of the-art in the field of ontology learning from the Web to see how it can contribute to the task of semantic Web querying. We consider three components of the query processing system: natural language ontologies, domain ontologies and ontology instances. We discuss the requirements for machine learning algorithms to be applied for the learning of the ontologies of each type from the Web documents, and survey the existent ontology learning and other closely related approaches.
ER  -

TY  - CONF
AU  - Dumais, Susan
AU  - Chen, Hao
A2  - 
T1  - Hierarchical classification of Web content
T2  - Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
PB  - ACM
CY  - New York, NY, USA
PY  - 2000/
M2  - 
VL  - 
IS  - 
SP  - 256
EP  - 263
UR  - http://doi.acm.org/10.1145/345508.345593
M3  - 10.1145/345508.345593
KW  - classification
KW  - web
KW  - bachelor:2011:bachmann
L1  - 
SN  - 1-58113-226-3
N1  - 
N1  - 
AB  - 
ER  -

TY  - JOUR
AU  - Chakrabarti, Soumen
AU  - Dom, Byron
AU  - Agrawal, Rakesh
AU  - Raghavan, Prabhakar
T1  - Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
JO  - The VLDB Journal
PY  - 1998/08
VL  - 7
IS  - 
SP  - 163
EP  - 178
UR  - http://dx.doi.org/10.1007/s007780050061
M3  - 10.1007/s007780050061
KW  - bachelor:2011:bachmann
KW  - organization
KW  - web
L1  - 
SN  - 
N1  - Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
N1  - 
AB  - We explore how to organize large text databases hierarchically by topic to aid better searching, browsing and filtering. Many corpora, such as internet directories, digital libraries, and patent databases are manually organized into topic hierarchies, also called <i>taxonomies</i>. Similar to indices for relational data, taxonomies make search and access more efficient. However, the exponential growth in the volume of on-line textual information makes it nearly impossible to maintain such taxonomic organization for large, fast-changing corpora by hand. We describe an automatic system that starts with a small sample of the corpus in which topics have been assigned by hand, and then updates the database with new documents as the corpus grows, assigning topics to these new documents with high speed and accuracy. To do this, we use techniques from statistical pattern recognition to efficiently separate the <i>feature</i> words, or <i>discriminants</i>, from the<i>noise</i> words at each node of the taxonomy. Using these, we build a multilevel classifier. At each node, this classifier can ignore the large number of &amp;ldquo;noise&amp;rdquo; words in a document. Thus, the classifier has a small model size and is very fast. Owing to the use of context-sensitive features, the classifier is very accurate. As a by-product, we can compute for each document a set of terms that occur significantly more often in it than in the classes to which it belongs. We describe the design and implementation of our system, stressing how to exploit standard, efficient relational operations like sorts and joins. We report on experiences with the Reuters newswire benchmark, the US patent database, and web document samples from Yahoo!. We discuss applications where our system can improve searching and filtering capabilities.
ER  -