## Tuesday, December 4, 2012

### Loading huge graphs with igraph and R

igraph is a library for "complex network research". While it integrates very well with R and provides a lot of convenient functions, huge graphs put a quick end to all the joy. The good news is: not all functions in igraph have bad performance. Some are actually amazingly fast - one just needs to avoid the bad ones.

For example, the help-pages of igraph discourage the use of the vertex- and edge-iterators due to their slowness on large graphs (see here, scroll down to "Notes"). However, they neither offer examples on how to avoid their usage, nor are these iterators the only functions with bad performance.
In the remainder of this and the next post, I assembled my solutions to some of the pitfalls I ran into. Of course, these might not be applicable in all use-cases but my hope is that these solutions will be useful for more people than me alone. Also, if you have different suggestions, please leave a comment.

Let us start with creating a random communication log. Let's say you work for a company with 1000 computers and you log all the communication between them. Such a log would probably consist of two columns (in it's easiest incarnation): one column for the computer sending and one column for the computer receiving. Each row would then represent one logged communication. With R, it is very easy to construct a randomized communication log:

df <- data.frame(src = sample(1:1000, 1000000, replace = TRUE),
dst = sample(1:1000, 1000000, replace = TRUE))

This is essentially our edge-list. We assume a directed communication. For example, in my case the first line of "df" looks like this:

> df[1,]
src dst
1 534 966

Which means that computer "534" sent something to computer "966".

### The Problem

Now, let's try to use the data-frame to create a graph object from it. igraph provides a great convenience function: "graph.data.frame". It takes a data-frame as argument and returns a new igraph-object ... perfect - or not? Let's try:

library(igraph)

cat("--- Creating data.frame... ")
start <- proc.time()
df <- data.frame(src = sample(1:1000, 10000000, replace = TRUE),
dst = sample(1:1000, 10000000, replace = TRUE))

cat(sprintf("--- elapsed time: %fs ", (proc.time() - start)))
cat("--- Creating graph... ")
start <- proc.time()
G <- graph.data.frame(df, directed = TRUE)
cat(sprintf("--- elapsed user-time: %fs ", (proc.time() - start)))

The script not only carries out the computation but also measures the user-time for each of the steps. If I execute this on my laptop, I get the following results: 0.39s for creating the data-frame and 17.2s for creating the graph. The problem is that the runtime of "graph.data.frame" grows linearly with the size of the input data-frame (i.e. one million edges result in 1.7s for creating the graph). Note, that 10 million communications is neither uncommon, nor the maximum that you might have to deal with: One billion is more in the range of "Big Graphs" and this means already 30 minutes of runtime for converting the data-frame into an igraph-object.

Even worse: the vertex-ids are converted into characters. This means that "V(G)" will return a list of characters and not a list of numeric values (which is what we have in our data-frame). This can be very inconvenient if, for example, your computer-ids have a meaning: Let's say that ids 0-99 belong to computers located on the ground floor, ids 100-199 belong to computers located on the first floor and so on. Maybe, then you would like to select all the vertices of the graph which belong to computers on the ground floor with: "V(G)[V(G)\$name < 99]". This is now impossible - since the expression "... < 99" expects "..." to be a numeric or integer - however, "V(G)\$name" returns a vector of characters. You can use "grep" in this case, but since you're matching on strings, the runtime will be much higher.

### The Solution

There is actually a solution to both problems: the runtime and the conversion to characters: don't use "graph.data.frame" at all. The following code is mainly what the implementation of "graph.data.frame" does - modulo some of the stuff that is not needed here because we know exactly how our data-frame looks like:

cat("--- Creating graph... ")
start <- proc.time()

vertex.attrs <- list(name = unique(c(df\$src, df\$dst)))
edges <- rbind(match(df\$src, vertex.attrs\$name),
match(df\$dst,vertex.attrs\$name))

G <- graph.empty(n = 0, directed = T)
G <- add.vertices(G, length(vertex.attrs\$name), attr = vertex.attrs)
G <- add.edges(G, edges)

remove(edges)
remove(vertex.attrs)

cat(sprintf("--- elapsed user-time: %fs ", (proc.time() - start)))

First, we create a list of the vertex-names (this is where the implementation of "graph.data.frame" transforms the names into characters). Then we create a list of edges: for each edge (= one row in our data-frame), we search the index of the "src" and "dst" in "vertex-attrs\$name" (using "match"): in the resulting igraph object, the vertex-ids will be equal to the index of the corresponding entry in "vertex.attrs\$name". The edges are composed of source- and destination-vertex-id. Note, that "vertex.attrs" is a bit of an over-kill here, because we do not need it to be a list: It only contains one entry - "name". However, this way it is more portable to other use-cases where you might want to have additional attributes.

Then we're almost done: create an empty graph and add vertices and edges. Since we don't need the two lists anymore, we can free up some memory by removing them. New runtime: 6.5s and the convenience of "V(G)[V(G)\$name < 99]" actually working.

The next blog post is going to be on how to avoid the use of the vertex- and edge-iterators.