Small World: communities

Showing posts with label communities. Show all posts

Tuesday, February 26, 2013

The Cost of Convenience

The previous posts were all about saving processing time with R. However, a question that came to my mind was: How fast would it be in C++? Or, in other words: What does the convenience of using R cost? Given, that we consider really large data, this might be an important question to answer before the implementation is started.

Executive Summary

Last time we reached 7s runtime for 10,000,000 edges with the optimized version of LPA on top of R and IGraph. Let's compare that to a simple C++ program - which does nothing but LPA. In a way that is very unfair to R/IGraph. Naturally, they loose performance by offering a more general interface which does much, much more than just LPA or graph processing. But in some cases, it might indeed pay off to implement a specialized function once and have it run many times on large data-sets in a fraction of the time.

If you don't wanna read through the details, here is the executive summary: The C++ LPA finishes in about 0.6s. Note, that it is in no way optimized, other than the '-O3', offered by the compiler. This means its more than 11 times faster than its R counter part! However, as you'll see below the code is longer and quite specialized.

Implementation Details

Let's start with the preamble - the data-structures. We use a map to describe the graph. Intuitively, the keys are vertices and the values are the neighbors. However, we need to store some more information - e.g. the groups. There are, in fact, many options to do that. I chose not to rely on the ordering of the map (i.e. by using a "groups" vector like we used in the R counterpart which shares the index of the map).

Instead, the values in my map are somewhat "fat": They include the group of the vertex, the list of neighbors and a buffer for proposals. The latter is needed because I chose to implement LPA in a two-way manner: First, there is a proposal-phase where every vertex proposes its own group to its neighbors. Then, every vertex evaluates the received proposals and chooses a (possibly) new community, based on that.

#include <map>

#include <vector>

#include <algorithm>

#include <string>

#include <iostream>

#include <fstream>

#include <stdint.h>

#include <ctime>

// contains the neighbors, all proposals and the group of a vertex

struct info {

std::vector<uint64_t> neighbors;

std::map<uint64_t, uint64_t> proposals;

uint64_t group;

};

// the actual graph, implemented as a map

typedef std::map<uint64_t, struct info> GraphT;

GraphT G;

// we use this to obtain the most fequent group in our neighborhood

bool value_comparer(std::map<uint64_t, uint64_t>::value_type &i1,

std::map<uint64_t, uint64_t>::value_type &i2)

{

return i1.second < i2.second;

}

This is the function we use to assign a unique group to all vertices initially, using the vertex names as group. It is actually so boring that we can just skip it ;)

// assign unique groups initially

void initial_groups() {

// run over all vertices and assign unique groups

for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {

v->second.group = v->first;

}

Here, it gets interesting. As noted above, this algorithm has two phases (1: proposal phase, 2: decision phase) which are repeated until no more vertices change their group. It is in no way surprising that this reminds of consensus: LPA is in fact thriving to reach a consensus of densely connected clusters in a graph.

// the actual algorithm

void run_LPA() {

std::clock_t begin = std::clock();

while(true) {

std::cout << "proposal phase\n";

bool done = true;

// run over all vertices and send proposals

for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {

for (std::vector<uint64_t>::const_iterator n = v->second.neighbors.begin();

n != v->second.neighbors.end();

++n)

{

G[*n].proposals[v->second.group]++;

}

std::cout << "decision phase\n";

// run over all vertices and evaluate proposals

for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {

if (v->second.proposals.size() == 0) continue;

uint64_t new_grp = std::max_element(v->second.proposals.begin(),

v->second.proposals.end(),

value_comparer)->first;

if (new_grp != v->second.group) done = false;

v->second.group = new_grp;

v->second.proposals.clear();

}

if (done) break;

}

std::clock_t end = std::clock();

double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;

std::cout << "Elapsed time: " << elapsed_secs << "s\n";

}

Loading the graph and running the algorithm.

// main program: read file, run LPA

int main(int argc, char* argv[]) {

// read the file

std::ifstream graph;

uint64_t src;

uint64_t dst;

graph.open(argv[1]);

if (graph.is_open()) {

while(graph.good()) {

graph >> src >> dst;

if (graph.eof()) break;

G[src].neighbors.push_back(dst);

}

graph.close();

}

// run the algorithm

initial_groups();

run_LPA();

}

Final Remarks

Let me point out some final thoughts and observations. First, I do not even think that this is the most efficient code out there. For example, STL maps are not exactly known for their performance. I'd bet that a plain C program would even be faster. Moreover, if pursued seriously, one would look around the corner at things like the boost graph library or GraphChi as well.

Thursday, December 13, 2012

Community detection algorithm with igraph and R - (2)

In the last post I presented a slightly modified LPA algorithm from the igraph wiki. This algorithm needed around 40s for 10,000,000 edges and 1000 unique vertices. I promised, that one could do much better. Here you go!

Iterators in igraph

To understand why this is possible, have a look at this quote from the igraph help:

"A note about the performance of the V and E functions, and the selection of edges and vertices. Since all selectors are evaluated as logical vectors on all vertices/edges, their performance is bad on large graphs. (Time complexity is proportional to the total number of vertices/edges.) We suggest using the neighbors, incident functions and simple R vector operations for manipulating vertex/edge sequences in large graphs."

So these functions have a bad performance - but can we avoid them in our algorithm and, if yes, how much benefit will we get?

Iteratorless LPA

If you look at the algorithm again, you'll notice that the iterator-function "V" is used to access the "group" property of the nodes. The problem is, that the "group" property is accessed in the inner-most loop of the algorithm. This means that the overhead of each call to "V" is multiplied by how often we visit the inner-most loop - i.e. |V|.

The solution is not to save the group property in the igraph-object, but to save it in a regular (and fast) R vector. It's really as easy as that because the indices in the vector are the same as the indices in the iterator - i.e.: "V(g)[i]$group == vgroups[i]". This yields in the following algorithm (altered lines are bold):

LargeScaleCommunityFast <- function(g, mode="all") {
cat("Assigning initial communities...\n")
vgroups <- V(g)$name
# random order in which vertices will be processed
cat("Generating random order...\n")
order <- sample(vcount(g), vcount(g))
t <- 0
done <- FALSE

while (!done) {
t <- t + 1
cat("round: ", t, "\n")

## change to FALSE whenever a node changes groups
done <- TRUE

for(i in order) {
## get the neighbor group frequencies:
group.freq <- table(vgroups[neighbors(g, i, mode = mode)])
## pick one of the most frequent:
new.group <- sample(names(group.freq)[group.freq == max(group.freq)], 1)
if (done) {
## we are only done if new group is the same as old group
done <- (new.group == vgroups[i])
}
vgroups[i] <- new.group
}
}

cat("Creating community-object...\n")
comms <- list(membership = as.numeric(vgroups),
vcount = vcount(g),
algorithm = "LPA",
names = V(g)$name)
class(comms) <- "communities"
return(comms)
}

Executing the algorithm

Running this algorithm against the same input as used in the previous post results in a runtime of 0.7s for 1,000,000 edges and 5s for 10,000,000 edges. This is almost 10 times faster than the previous (7s, 40s).

You can do the same with properties of edges (like weights) - this is one of the things I'd like to write about next.

Monday, December 10, 2012

Community detection algorithm with igraph and R - (1)

In the first entry on this blog I gave an example on how to load huge graphs with R. The biggest problem, however, is actually doing something useful with huge graphs. This post is somewhat of a preparation for the next post on iterators in igraph. I thought it would be better to explain the limitations and how to avoid them with a real example: Community detection. igraph actually provides several different community-detection algorithms, which are written in plain C and, therefore, run quite fast. However, it does not provide a label propagation algorithm (LPA) which belongs to the broader spectrum of modularity optimization algorithms.

How does it work?

Essentially, LPA tries to find a consensus of labels in densely connected clusters (communities) of a graph. It does so in several iterations. On the first iteration, every vertex in the graph chooses a unique community. This could be, for example, its own label. Every vertex will then "send" its own label to all its neighbors.

On every subsequent iteration, each vertex counts the labels it received from it's neighbors in the previous iteration and selects the label with the highest count. If this is the second iteration, each label will only be received once (it is as if all neighbors sent their own, unique, id). However, the vertices will select a random label, if there is no clear winner in the counts.

Ultimately, the algorithm terminates if, during the current iteration, no vertex had to change its label.

You can see, that eventually some labels will win over others. If you do not believe it, you are not wrong either: "Resolution limit in community detection", is just one of the many scientific publications on issues with community detection.

An implementation in igraph

There is actually a basic implementation of this algorithm on the igraph-wiki. Here is a slightly adapted version which creates an actual community-object. This way, you can use it in other igraph-algorithms as well (such as create nice plots).

LargeScaleCommunity <- function(g, mode="all") {

cat("Assigning initial communities...\n")

V(g)$group <- V(g)$name

## random order in which vertices will be processed

cat("Generating random order...\n")

order <- sample(vcount(g), vcount(g))

t <- 0

done <- FALSE

while (!done) {

t <- t + 1

cat("round: ", t, "\n")

## change to FALSE whenever a node changes groups

done <- TRUE

for(i in order) {

## get the neighbor group frequencies:

group.freq <- table(V(g)[neighbors(g, i, mode = mode)]$group)

## pick one of the most frequent:

new.group <- sample(names(group.freq)[group.freq == max(group.freq)], 1)

if (done) {

## we are only done if new group is the same as old group

done <- (new.group == V(g)[i]$group)

}

V(g)[i]$group <- new.group

}

cat("Creating community-object...\n")

comms <- list(membership = as.numeric(V(g)$group),

vcount = vcount(g),

algorithm = "LPA",

names = V(g)$name)

class(comms) <- "communities"

return(comms)

}

Input

One obvious application for these algorithms are graphs of social interaction. Since I wanted to make this as independent as possible from external data-sources, we'll have to construct an artificial graph. The interesting thing is that their structure differs only in one important point from purely random graphs: All social graphs are "scale-free". This means that very few vertices will have the majority of the edges and the majority of the vertices will have very few edges. It is actually easy to see why that is the case: A famous artist will have many connection (e.g. followers on Twitter or g+). But the ordinary person will have only a few.

Generating such a graph in R is quite easy:

cat("--- Creating data.frame...\n")
start <- proc.time()
df <- data.frame(src = sample(1:100, 1000000, replace = TRUE, prob = exp(0.5*(1:100))),
dst = sample(1:100, 1000000, replace = TRUE, prob = exp(0.5*(1:100))))
cat(sprintf("--- elapsed time: %fs\n\n", (proc.time() - start)[1]))

In contrast to the last time, we provide a vector of probabilities for the sample: "prob". This vector is constructed using a "power law" - i.e. from the 100 available vertices it will decide much more often for the 100, than for the 1.

Running the algorithm

Subsequently, we create the graph-object and run the community-detection:

cat("--- Creating graph...\n")
start <- proc.time()
vertex.attrs <- list(name = unique(c(df$src, df$dst)))
edges <- rbind(match(df$src, vertex.attrs$name), match(df$dst, vertex.attrs$name))
G <- graph.empty(n = 0, directed = T)
G <- add.vertices(G, length(vertex.attrs$name), attr = vertex.attrs)
G <- add.edges(G, edges)
cat(sprintf("--- elapsed user-time: %fs\n\n", (proc.time() - start)[1]))

cat("--- Detecting communities...\n")
start <- proc.time()
C <- LargeScaleCommunity(G)

cat(sprintf("--- elapsed time: %f\n\n", (proc.time() - start)[1]))

Running this example on my laptop results in a runtime of around 7s. Not that bad, right? But wait, it's a fairly small graph with only 100 different vertices and 1,000,000 edges. For 1000 unique vertices and 10,000,000 edges it's already around 40s. I won't dare to run this for a real large graph on my notebook. The good news is, however, it can be much faster. But that's material for the next post
.