Thursday, December 13, 2012

Community detection algorithm with igraph and R - (2)

In the last post I presented a slightly modified LPA algorithm from the igraph wiki. This algorithm needed around 40s for 10,000,000 edges and 1000 unique vertices. I promised, that one could do much better. Here you go!

Iterators in igraph

To understand why this is possible, have a look at this quote from the igraph help:
"A note about the performance of the V and E functions, and the selection of edges and vertices. Since all selectors are evaluated as logical vectors on all vertices/edges, their performance is bad on large graphs. (Time complexity is proportional to the total number of vertices/edges.) We suggest using the neighborsincident functions and simple R vector operations for manipulating vertex/edge sequences in large graphs."
So these functions have a bad performance - but can we avoid them in our algorithm and, if yes, how much benefit will we get?

Iteratorless LPA

If you look at the algorithm again, you'll notice that the iterator-function "V" is used to access the "group" property of the nodes. The problem is, that the "group" property is accessed in the inner-most loop of the algorithm. This means that the overhead of each call to "V" is multiplied by how often we visit the inner-most loop - i.e. |V|.
The solution is not to save the group property in the igraph-object, but to save it in a regular (and fast) R vector. It's really as easy as that because the indices in the vector are the same as the indices in the iterator - i.e.: "V(g)[i]$group == vgroups[i]". This yields in the following algorithm (altered lines are bold):

  LargeScaleCommunityFast <- function(g, mode="all") {
    cat("Assigning initial communities...\n")
    vgroups <- V(g)$name
    # random order in which vertices will be processed
    cat("Generating random order...\n")
    order <- sample(vcount(g), vcount(g))
    t <- 0
    done <- FALSE

    while (!done) {

      t <- t + 1
      cat("round: ", t, "\n")

      ## change to FALSE whenever a node changes groups

      done <- TRUE

      for(i in order) {

        ## get the neighbor group frequencies:
        group.freq <- table(vgroups[neighbors(g, i, mode = mode)])
        ## pick one of the most frequent:
        new.group <- sample(names(group.freq)[group.freq == max(group.freq)], 1)
        if (done) {
          ## we are only done if new group is the same as old group
          done <- (new.group == vgroups[i])
        }
        vgroups[i] <- new.group
      }
    }

    cat("Creating community-object...\n")

    comms <- list(membership = as.numeric(vgroups),
                  vcount = vcount(g),
                  algorithm = "LPA",
                  names = V(g)$name)
    class(comms) <- "communities"
    return(comms)
}

Executing the algorithm

Running this algorithm against the same input as used in the previous post results in a runtime of 0.7s for 1,000,000 edges and 5s for 10,000,000 edges. This is almost 10 times faster than the previous (7s, 40s).
You can do the same with properties of edges (like weights) - this is one of the things I'd like to write about next.