Community Detection

Community Detection

Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004). M. E. J. Newman. Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004). The Problem Can we partition the network into groups s.t. the inter-group edges are sparse while the intra-group edges are dense? Why is it interesting/useful? Understanding comm. structure means to understanding n/w structure. Graph partitioning similar problem; graph of processes, edges=communication; assign sub-graphs to processors to minimize interprocessor comm. & balance processor load. (NP-hard in general.) Diff. w/ graph partitioning.

An Example with Three Communities A Hierarchical Clustering Approach 1. 2. 3. 4. 5. 6. Define a notion of similarity or affinity between nodes. E.g.: := #node-disjoint paths between and . := #edge-disjoint paths between and . := weighted sum of all paths, with longer paths weighted down, e.g., Katz! Qn: how can we compute #2, 3 fast? (Efficient algorithms for Katz have been developed.)

Community detection via hierarchical clustering Compute all pairwise node similarities for every edge present. Repeatedly add edges with greatest similarity. leads to a tree (called dendrogram). A slice throguh the dendogram represents a clustering or comm. structure. Dendrogram example Limitations of HC approach Misplaces periphery.

5 E.g.: 4 nodes in the 1 2 3 Which community should 5 belong to? Alternative approach based on edge betweenness. Key Intuition An inter-comm. edge has a higher betweenness compared to an intra-comm. edge, i.e., more paths between node pairs pass through it. Start with G.

Repeatedly remove edges with highest betweenness until . Communities = resulting Basic Algorithm repeat { Calculate betweenness of all edges; Remove one with highest betweenness, breaking ties arbitrarily; } Until no edges left. Remarks: Which betweenness score? Calculate upfront and reuse or recalculate? Can we incrementally recalculate after each edge removal? Related algorithms for node betweenness by Newman and Brandes.

A Real Example (Zacharys Karate Club) With recalculation of betweenness. Without recalculation of betweenness. Scalability Issues Edge betweenness for all edges can be computed in time (=#edges, =#nodes). [Newman 2001] details soon. Recalculation makes algorithm , so not feasible for large networks. Computing edge betweenness An Example b

d a g c e f Compute #geodesics from every node to g. Breadth-first search means for doing many things. Computing edge betweenness An Example

b d d=0 w=1 a c g e f Breadth-first search means for doing many things. Computing edge betweenness An Example b

d=1 w=1 d d=0 w=1 a c e d=1 w=1 g f Breadth-first search means for doing many things.

Computing edge betweenness An Example d=2 w=2 b d=1 w=1 d d=0 w=1 a c d=2 w=2

e d=1 w=1 f g d=2 w=2 Breadth-first search means for doing many things. Computing edge betweenness An Example d=2 w=2 d=3 w=4

b d=1 w=1 d d=0 w=1 a c d=2 w=2 e d=1 w=1 f

g Have all info. we need for edge betweenness now. d=2 w=2 Breadth-first search means for doing many things. Computing edge betweenness An Example d=2 w=2 d=3 2/4

w=4 b d=1 w=1 d 1/2 a 2/4 d=0 w=1 c d=2 w=2

d=1 w=1 1/2 e f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search means for doing many things.

Computing edge betweenness An Example d=2 w=2 d=3 2/4 w=4 b (1+2/4) d (1 + a 2/4 + (1

) 4 2/ d=1 w=1 d=0 w=1 2/ 4) 1/2 d=1 w=1 1/2 c (1+2/4)e d=2 w=2

f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search means for doing many things. Computing edge betweenness An Example d=2 w=2

d=3 2/4 w=4 + (1 ) 4 2/ d=1 w=1 b (1+2/4) d (1 + a

2/4 1/1[ 1+(1+2/4)+1/2(1+2/4)+1/2 d=0 w=1 2/ 4) 1/2 d=1 w=1 1/2 c (1+2/4)e d=2 w=2 f g

Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search means for doing many things. EB Computation summary For any one target node, compute weights of nodes by BFS; = #geodesics from to target. Suppose rest of (containing target). Then intuitively, of the geodesics

from to the target node go through . EB Computation summary (contd.) For any edge ( further from target than ), = The above is wrt a specific target node. Overall bet for any edge = sum of bet wrt every node treated as target node. EB computation complexity analysis For any one target node, BFS gives bet of every edge w.r.t. that target node, in time. Doing so for every node treated

as target node time for final betweenness score for every edge. Quite elegant, but recalculation bumps up complexity to Need more scalable approaches On scaling up CD algorithm determine intelligently which edges need their bet recalculated, when an edge is removed. When is removed, needs to be recalculated only if is in the same connected component as . For a very large component, doesnt prune much. Perhaps its only important to determine the edge with the next highest bet. can

we maintain enough state so that when is removed, we can recalculate incrementally, i.e., not from scratch? Point to ponder! Closing Remarks 1/2 Newman also proposed other bases for defining edge betweenness. Electrical current flow through the edge where every edge is viewed as unit resistance and we consider all sourcesink pairs. Based on random walks. Both less effective and more expensive than geodesics (see paper for details). What about directed and weighted cases? Closing Remarks 2/2 Goodness

metric of community division. Helpful when we dont know the ground truth. Q = i (eii ai2 ), where Ekxk= matrix of community division: eij = fraction of edges linking comm. i to comm. j; ai = j eij . Q measures fraction of intra-comm. edges over what is expected by chance (assuming uniform distribution). See paper for details of experimental results. Turns out study of influence/information propagation can suggest new ways of detecting communities: will revisit this issue after we study influence propagation. Recommended Reading J. Ruan and W. Zhang. An Efcient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks. ICDM 2007. M. E. J. Newman "Modularity and community structure in networks", physics/0602124 = Proceedings of the National Academy of Sciences (USA) 103 (2006):

875778582. Jure Leskovec, Kevin J. Lang, and Michael W. Mahoney. Empirical Comparison of Algorithms for Network Community Detection. WWW 2010. M. E. J. Newman. Communities, modules and largescale structure in networks. Nature Physics 8, 2531 (2012) doi:10.1038/nphys2162 Received 23 September 2011 Accepted 04 November 2011 Published online 22 December 2011.

Recently Viewed Presentations

  • Evidence and Elaboration - English Language Arts with Mr ...

    Evidence and Elaboration - English Language Arts with Mr ...

    Practice, Practice, Practice! YOUR TASK: Select TWO pieces of evidence below and practice your elaborative techniques in a blank word document. Before moving to the next slide, let me check your practice portion. (You may change the wording, the signal...
  • Introduction to Profibus Michael Bryant Executive Director Profibus

    Introduction to Profibus Michael Bryant Executive Director Profibus

    New Carrier (CVN 77) USS Carl Vinson USS Chester A. Nimitz USS Harry S. Truman WWW.PROFIBUS.COM WWW.PROFIBUS.COM WWW.PROFIBUS.COM * * The communications capability of devices and continuous, transparent information are indispensable components of future-oriented automation concepts.
  • Adjectives and Adverbs

    Adjectives and Adverbs

    Arial Calibri Wingdings Arial Black Office Theme 1_Office Theme PowerPoint Presentation Adjectives and Adverbs PowerPoint Presentation An adjective and/or adverb item on an objective test might look like this . . . Sample Item Adjectives describe nouns. Adverbs modify verbs,...
  • Folie 1 - uni-luebeck.de

    Folie 1 - uni-luebeck.de

    Slidesfrom AIMA bookprovidedby Cristina Conati, UBC. Data Mining Bayesian Networks. Full Bayesian Learning. MAP learning. Maximum Likelihood Learning. Learning Bayesian Networks. Fully observable. With hidden (unobservable) variables. Full Bayesian Learning.
  • The Origins of Life and Precambrian Evolution

    The Origins of Life and Precambrian Evolution

    The Origins of Life and Precambrian Evolution Chapter 16 Questions What was the first living thing? Where did it come from? What was the last common ancestor of today's organisms and when did it live?
  • Challenging Occupational Choice Theory  enabling students to take

    Challenging Occupational Choice Theory enabling students to take

    Challenging Occupational Choice Theory - enabling students to take control of their destinyMark Stow Head of Careers & EmployabilityRachal LilleyStudent Employment Co-ordinator . To give a brief appraisal of some of the common theories of Occupational Choice. To identify the...
  • Feelings, Faces and Food: Mentalization in BPD and ED A ...

    Feelings, Faces and Food: Mentalization in BPD and ED A ...

    INTRO Research on mentalization in BPD and ED Mentalization lower in BPD and ED (Fonagy et al, 1996) Resilience -Capacity to mentalize can mediate effects of childhood abuse (Fonagy et al, submitted) Mentalisation Based Therapy effective for BPD - (Bateman...
  • Building an Institutional Research Repository from the Ground Up:

    Building an Institutional Research Repository from the Ground Up:

    Building an Institutional Research Repository from the Ground Up: The ARROW Experience Status Snapshot as of September 2004 (pre-Bandicoot) Dr Andrew Treloar