Until now I've always used randomly generated graphs which was completely fine but I've always wanted to use a real graph at some point. And since I'm gonna be sitting in a train for a couple of hours and have a thing for social graphs, I thought it'd be nice to look at what information we can extract from Facebook's likes. And if that's not reason enough - this is my first post which is going to be linked from the famous r-bloggers web site.
Introduction
Facebook provides access to the "who likes whom" graph through its graph-api which can be queried through a REST interface: "https://graph.facebook.com/---some-object-id---/likes?access_token=---your-access-token---". The semantics of the returned data depends on the "object-id". In our case, we are interested in public pages. So if we query the interface for - e.g. "Blogf.social", we receive a list of pages which like Blogf. Obviously, we can then recursively get the pages which like those pages and so on. That's essentially a breadth-first search on the like graph of Facebook. So what are we going to find? Blogf is a community of of German speaking female bloggers assembled under the slogan "women write better blogs". Regardless of whether that's true or not - it's an interesting mix of people and topics. In any case, pages which like Blogf would be expected to be somewhat related to women.
Ego Graph
The following graph shows the result after a breadth-first search of depth 3, starting at Blogf. Vertices are pages and edges represent likes - i.e. "A -> B" means "A likes B". Green vertices have substantially fewer likes than red vertices. The vertex- and label-size encode the same information. I also removed the vertices for which we didn't query the likes (because we reached the maximum depth of 3). Blogf is centered in the graph with several clusters of pages around it. What is interesting is that while most of the pages are about women or online media, they cluster into different communities. On the top-left, for example, there is a cluster of pages about the environment connecting through "Netzfrauen (english: Netwomen)" to Blogf. Likewise, to the right, there is a more political cluster unsurprisingly connected through "Gender equality" to Blogf. Interestingly, however, this seems to be a completely different kind of political sites (with an emphasis on human rights) than the crowd on the bottom right (with an emphasis on left-wing politics), connected through "Wer braucht Feminismus (Who needs feminism)".
In the literature, such graphs are often referred to as "ego-graphs" because they represent the network as seen by an individual entity. It has been shown that by using such graphs to detect communities, the result is closer to the expected communities than when a complete graph is used (see: "DEMON: a local-first discovery method for overlapping communities"). However, I didn't want to write yet another article about community detection. My goal was something else: Can we use this graph and group the pages by interest?
To this end we apply a technique that I came across in a security paper: It's a projection of the graph in which vertices are connected by undirected edges if they like the same page. In "Intrusion as (anti)social communication: characterization and detection", the authors use this technique to identify clusters of source IP addresses which talk to same target IP addresses. Most of the IP addresses are then assumed to be embedded within dense clusters but some are connected to many clusters which the authors define as intrusion (or, in their words: "Entering a community to which one does not belong"). The result for our "who likes whom" or "common interest" graph is shown below.
Common Interests Graph
To this end we apply a technique that I came across in a security paper: It's a projection of the graph in which vertices are connected by undirected edges if they like the same page. In "Intrusion as (anti)social communication: characterization and detection", the authors use this technique to identify clusters of source IP addresses which talk to same target IP addresses. Most of the IP addresses are then assumed to be embedded within dense clusters but some are connected to many clusters which the authors define as intrusion (or, in their words: "Entering a community to which one does not belong"). The result for our "who likes whom" or "common interest" graph is shown below.
To get an intuitive understanding of the graph, consider the vertices to be people standing in a room talking with their neighbors. Some are part of tight groups (in fact, they are completely connected, forming a clique), some are literally caught between the cracks, talking to several groups at once. Finally, the topic they talk about is the page they like (as denoted by the green labels on the edges). Now it becomes obvious that the graph from before indeed contains a diverse assortment of interests. The large cluster on the top is the set of pages which like "Wer braucht Feminismus (Who needs feminism)". Interestingly, this cluster is disconnected. This means that "Wer braucht Feminismus (Who needs feminism)" is not liked by any other page in the Blogf's ego graph. Also, it is composed of a quite diverse set of pages - including political sites (such as the left-wing party "Die Linke"), several pages from federal states to sites clearly targeted towards women. There is one other large much more homogeneous cluster which is also disconnected: The set of pages which like "Gender Equality Germany", mostly composed of human-rights oriented pages. On the top right we find a community of start-up and business oriented pages and the set of pages which like "Blogf" can be found at the center of the graph. It is well connected to the rest of the graph. Interesting are the pages through which different clusters connect - such as "Gender Equality Germany" because those pages connect different communities through their diverse interests. As such, they are primary targets for closer collaboration (such as exchanging likes, cross-linking and so on).
Conclusion
For me it was quite interesting to see the different topics that circle around a page like Blogf. For Blogf, this does not only mean that there are several pages which still don't know about or like Blogf (or which, as a matter of fact should be liked by Blogf) but it also gives a hint at which kind of topics people that like Blogf are interested in - i.e. it could help the editor to select the right blog-posts from the blogosphere. If you like to run the same analysis for your site, you can find the code here. The crawler which queries the likes from facebook is written in python, the plotting and projection tasks are done in R.
Update (04/28/2014)
Paul Sorenson from metrak.com provided some improvements to the code. Not only can you now use the command line to specify which pages you want to crawl and up to which depth. The real improvement is that the python script now keeps the connection to facebook active - making the crawler run much faster. He also pointed me to the "combn" function in R which does exactly what is needed to construct the projetion - only much faster than my plain R implementation. Finally, he found a bug in my code - I didn't upload the version which actually prints the edge-labels on the common interest graph. Sorry for that! Check out the update on the Bitbucket repository and many thanks to Paul!
No comments:
Post a Comment