ConferenceProceedingsinproceedings1529607Applying latent dirichlet allocation to group discovery in large graphs2009HendersonKeithEliassi-RadTina1456-1461 ISBN: 978-1-60558-166-8 DOI: http://doi.acm.org/10.1145/1529282.1529607ACMNew York, NY, USASAC '09: Proceedings of the 2009 ACM symposium on Applied ComputingSAC '09: Proceedings of the 2009 ACM symposium on Applied Computinghttp://portal.acm.org/citation.cfm?id=1529607This paper introduces LDA-G, a scalable Bayesian approach to finding latent group structures in large real-world graph data. Existing Bayesian approaches for group discovery (such as Infinite Relational Models) have only been applied to small graphs with a couple of hundred nodes. LDA-G (short for Latent Dirichlet Allocation for Graphs) utilizes a well-known topic modeling algorithm to find latent group structure. Specifically, we modify Latent Dirichlet Allocation (LDA) to operate on graph data instead of text corpora. Our modifications reflect the differences between real-world graph data and text corpora (e.g., a node's neighbor count vs. a document's word count). In our empirical study, we apply LDA-G to several large graphs (with thousands of nodes) from PubMed (a scientific publication repository). We compare LDA-G's quantitative performance on link prediction with two existing approaches: one Bayesian (namely, Infinite Relational Model) and one non-Bayesian (namely, Cross-association). On average, LDA-G outperforms IRM by 15% and Cross-association by 25% (in terms of area under the ROC curve). Furthermore, we demonstrate that LDA-G can discover useful qualitative information.community, detection, lda