Publications
Generality of Texts
Allen, R. & Wu, Y.
Lim, E.; Foo, S.; Khoo, C.; Chen, H.; Fox, E.; Urs, S. & Costantino, T., ed., 'Digital Libraries: People, Knowledge, and Technology', 2555(), Springer, Berlin / Heidelberg, 111-116 (2010) [pdf]
When searching or browsing, a user may be looking for either very general information or very specific information. We explored predictors for characterizing the generality of six encyclopedia texts. We had human subjects rank-order the generality of the texts. We also developed statistics from analysis of word frequency and from comparison to a set of reference terms. We found a statistically significant relationship between the human ratings of text generality and our automatic measure.
Predicting Partial Orders: Ranking with Abstention.
Cheng, W.; Rademaker, M.; Baets, B. D. & Hüllermeier, E.
Balcázar, J. L.; Bonchi, F.; Gionis, A. & Sebag, M., ed., 'ECML/PKDD (1)', 6321(), Lecture Notes in Computer Science, Springer, 215-230 (2010) [pdf]
Exploring Wikipedia and DMoz as Knowledge Bases for Engineering a User Interests Hierarchy for Social Network Applications
Haridas, M. & Caragea, D.
Meersman, R.; Dillon, T. & Herrero, P., ed., 'On the Move to Meaningful Internet Systems: OTM 2009', 5871(), Springer, Berlin / Heidelberg, 1238-1245 (2009) [pdf]
The outgrowth of social networks in the recent years has resulted in opportunities for interesting data mining problems, such as interest or friendship recommendations. A global ontology over the interests specified by the users of a social network is essential for accurate recommendations. We propose, evaluate and compare three approaches to engineering a hierarchical ontology over user interests. The proposed approaches make use of two popular knowledge bases, Wikipedia and Directory Mozilla, to extract interest definitions and/or relationships between interests. More precisely, the first approach uses Wikipedia to find interest definitions, the latent semantic analysis technique to measure the similarity between interests based on their definitions, and an agglomerative clustering algorithm to group similar interests into higher level concepts. The second approach uses the Wikipedia Category Graph to extract relationships between interests, while the third approach uses Directory Mozilla to extract relationships between interests. Our results show that the third approach, although the simplest, is the most effective for building a hierarchy over user interests.
Personalised Tag Recommendation
Landia, N. & Anand, S.
Recommender Systems & the Social Web (2009) [pdf]
A Metric-based Framework for Automatic Taxonomy Induction
Yang, H. & Callan, J.
, 'Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL2009)', Singapore, 271–279 (2009)
From key words to key semantic domains
Rayson, P.
International Journal of Corpus Linguistics, 13() 519-549(31) (December 2008) [pdf]
This paper reports the extension of the key words method for the comparison of corpora. Using automatic tagging software that assigns part-of-speech and semantic field (domain) tags, a method is described which permits the extraction of key domains by applying the keyness calculation to tag frequency lists. The combination of the key words and key domains methods is shown to allow macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) and thereby suggesting those linguistic features which should be investigated further. The resulting 'data-driven' approach presented here combines elements of both the 'corpus-based' and 'corpus-driven' paradigms in corpus linguistics. A web-based tool, Wmatrix, implementing the proposed method is applied in a case study: the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties.
From key words to key semantic domains
Rayson, P.
International Journal of Corpus Linguistics, 13() 519-549(31) (2008) [pdf]
This paper reports the extension of the key words method for the comparison of corpora. Using automatic tagging software that assigns part-of-speech and semantic field (domain) tags, a method is described which permits the extraction of key domains by applying the keyness calculation to tag frequency lists. The combination of the key words and key domains methods is shown to allow macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) and thereby suggesting those linguistic features which should be investigated further. The resulting 'data-driven' approach presented here combines elements of both the 'corpus-based' and 'corpus-driven' paradigms in corpus linguistics. A web-based tool, Wmatrix, implementing the proposed method is applied in a case study: the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties.
Document generality: its computation for ranking
Yan, X.; Li, X. & Song, D.
, 'ADC '06: Proceedings of the 17th Australasian Database Conference', Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 109-118 (2006) [pdf]
The increased variety of information makes it critical to retrieve documents which are not only relevant but also broad enough to cover as many different aspects of a certain topic as possible. The increased variety of users also makes it critical to retrieve documents that are jargon free and easy-to-understand rather than the specific technical materials. In this paper, we propose a new concept namely document generality computation. Generality of document is of fundamental importance to information retrieval. Document generality is the state or quality of document being general. We compute document generality based on a domain-ontology method that analyzes scope and semantic cohesion of concepts appeared in the text. For test purposes, our proposed approach is then applied to improving the performance of document ranking in bio-medical information retrieval. The retrieved documents are re-ranked by a combined score of similarity and the closeness of documents' generality to that of a query. The experiments have shown that our method can work on a large scale bio-medical text corpus OHSUMED (Hersh, Buckley, Leone & Hickam 1994), which is a subset of MED-LINE collection containing of 348,566 medical journal references and 101 test queries, with an encouraging performance.
Extensions of the Paivio, Yuille, and Madigan (1968) norms
Clark, J. & Paivio, A.
Behavior Research Methods, Instruments, & Computers, 36(3) 371 (2004) [pdf]
Generality-Based Conceptual Clustering with Probabilistic Concepts
Talavera, L. & Béjar, J.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2) 196-206 (2001) [pdf]
WORD FREQUENCY AND WORD DIFFICULTY
Breland, H.
Psychological Science, 7(2) 96-99 (1996) [pdf]
Knowledge-based automatic topic identification
Lin, C.-Y.
, 'Proceedings of the 33rd annual meeting on Association for Computational Linguistics', Association for Computational Linguistics, Morristown, NJ, USA, [http://dx.doi.org/10.3115/981658.981705], 308-310 (1995) [pdf]
Imagery, concreteness, emotionality, and meaningfulness values of words: replication and extension.
Campos, A. & González, M.
Perceptual and motor skills, 74(3 Pt 1) 691 (1992) [pdf]
Object categories and expertise: Is the basic level in the eye of the beholder?
Tanaka, J. W. & Taylor, M.
Cognitive Psychology, 23(3) 457-482 (1991) [pdf]
Classic research on conceptual hierarchies has shown that the interaction between the human perceiver and objects in the environment specifies one level of abstraction for categorizing objects, called the basic level, which plays a primary role in cognition. The question of whether the special psychological status of the basic level can be modified by experience was addressed in three experiments comparing the performance of subjects in expert and novice domains. The main findings were that in the domain of expertise (a) subordinate-level categories were as differentiated as the basic-level categories, (b) subordinate-level names were used as frequently as basic-level names for identifying objects, and (c) subordinate-level categorizations were as fast as basic-level categorizations. Taken together, these results demonstrate that individual differences in domain-specific knowledge affect the extent that the basic level is central to categorization.
Two meanings of word abstractness
Kammann, R. & Streeter, L.
Journal of Verbal Learning and Verbal Behavior, 10(3) 303 - 306 (1971) [pdf]
Word abstractness has been defined in terms of hierarchical superordination or empirical ratings based on accessibility to the senses. Since a high-level superordinate (a generic term) should not be accessible to the senses, the two definitions should be correlated. Four Ss constructed word hierarchies from a pool of 925 nouns. Neither the size of a patriarch's hierarchy, nor its status as a superordinate was noticeably predictive of its abstractness rating, while its particular hierarchy membership was. The two definitions of abstractness appear to be mostly orthogonal. Subjects appear to rate the abstractness of a generic noun in terms of the abstractness of its exemplars.
Words and Things
Brown, R.
1968, Free Press [pdf]
CONCRETENESS, IMAGERY, AND MEANINGFULNESS VALUES FOR 925 NOUNS
PAIVIO, A.; YUILLE, J. C. & MADIGAN, S. A.
Journal of Experimental Psychology, 76(1, Part 2) 1 - 25 (1968) [pdf]
GROUPS OF SS, 17-46 YR. OLD COLLEGE STUDENTS, WERE USED TO SCALE 925 NOUNS ON ABSTRACTNESS-CONCRETENESS (C), IMAGERY (I), AND MEANINGFULNESS (M). CONCRETENESS WAS DEFINED IN TERMS OF DIRECTNESS OF REFERENCE TO SENSE EXPERIENCE, AND I, IN TERMS OF WORD'S CAPACITY TO AROUSE NONVERBAL IMAGES; C AND I WERE RATED ON 7-POINT SCALES. MEANINGFULNESS WAS DEFINED IN TERMS OF THE MEAN NUMBER OF WRITTEN ASSOCIATIONS IN 30 SEC. THE MEAN SCALE VALUES FOR THESE VARIABLES ARE PRESENTED FOR EACH OF THE 925 NOUNS. ALSO REPORTED ARE THE INTERCORRELATIONS OF THE VARIABLES, TOGETHER WITH AN EXAMINATION OF THE WORDS FOR WHICH C, I, AND M VALUES ARE MOST CLEARLY DIFFERENTIATED; AND RELIABILITY DATA, INCLUDING COMPARISONS WITH SCALE VALUES FOR THE VARIABLES FROM OTHER STUDIES. (45 REF.) (PsycINFO Database Record (c) 2006 APA, all rights reserved)
Word abstractness and meaningfulness, and paired-associate learning in children
Paivio, A. & Yuille, J. C.
Journal of Experimental Child Psychology, 4(1) 81 - 89 (1966) [pdf]
Research with adults has shown that paired-associate (PA) learning of nouns, with abstractness-concreteness of the words simultaneously varied on both sides of pairs, is facilitated by concreteness, and this effect is greater on the stimulus than on the response side. The problem was investigated further in the present study with fourth-, sixth-, and eighth-grade children. Since concreteness has been found to correlate with meaningfulness (m), data were first obtained on the m of 32 concrete and 32 abstract, high-frequency nouns. At all three grade levels, the m of concrete nouns was higher than that of abstract nouns, and the words significantly retained their m rank across grades. Four comparable versions of a 16-pair list were constructed from 32 of the nouns, each list including 4 pairs of each possible S-R combination, i.e., concrete-concrete, concrete-abstract, abstract-concrete, and abstract-abstract. Groups of Sa were auditorially presented 4 alternating study trials and recall trials with a list. Analysis of the recall scores for Ss from each of three schools showed that recall increased with grade, and that positive effects of concreteness were generally greater on the stimulus than on the response side of pairs. The differential effect favoring stimulus over response concreteness was, however, smaller than in the earlier research with adults, and somewhat inconsistent across schools.
Learning as a function of word-frequency
Hall, J.
The American Journal of Psychology, 67(1) 138-140 (1954) [pdf]