Thursday, December 6, 2012

Almost stateless aggregate

Recently, I had to aggregate the 3rd column of a huge file by groups of the first and second column. Now you'll find many examples on the web on how to aggregate a file by groups. However, they usually construct an associative array along the way and print the array at once in the end. But what happens when you run out of memory before you can dump the array to disk?

One solution

One solution is to order the file first - such that the groups appear in blocks, rather than spread across the complete file. This permits a stateless implementation of an aggregate which I'll show below using "awk". Of course, you could use any other tool just as well.

Doing it the *nix way

If you know your way around "sort", it's quite easy to re-order the lines in the file in such a way that the result is grouped by the first two columns:

  sort -k1,2 <input>

This will turn a list like this:

  1,2,10
  4,8,11
  1,4,12
  1,2,13
  4,8,14

into a grouped list:

  1,2,10
  1,2,13
  1,4,12
  4,8,11
  4,8,14


All you need now is this "awk"-script which does the aggregation on such a sorted file without consuming more than what is need to store an int in an associative array.

  # executed before everything else
  # tells awk, that we're dealing with
  # a CSV-file
  BEGIN {
    FS = ","
  }

  # executed on every line of the file
  {
    if (!(($1 "," $2) in current) && NR > 1) {
      for (s in current) {
        print s "," current[s]
        delete current[s]
      }
    }

    current[$1 "," $2] += $3
  }

  # executed after the last line has been read
  END {
    for (s in current) {
      print s "," current[s]
    }
  }

The key is that the script uses the first and second columns of the file as keys for the associative array: "current[$1 "," $2]" therefore points to the aggregated value of the third column for the current group. Once we find a combination of "[$1 "," $2]" which does not exist yet (i.e. a new group), all entries in the array are printed (there is always just one) and deleted. Then the new group is added and so on. The last group is printed when we reach the "END" part of the script.