Shilad, Dan, and I wrote a short paper called “Supporting Social Recommendations with Activity-Balanced Clustering“. I presented the paper at Recommender Systems 2007 here in Minneapolis. The idea is that standard clustering algorithms such as k-means simply don’t work well if you’re trying to create groups of users that manifest similar levels of activity. In MovieLens, k-means placed 84% of the active users in a single cluster. Thus, we mashed up some techniques from the clustering literature to create an algorithm that improved on k-means in terms of cluster quality and computational complexity, while giving us the balanced properties we sought. One of the key insights is that stable matching algorithms are a clever way to evenly divide sets of things based on similarity metrics.
I received lots of feedback from conference attendees. One of the most interesting questions (that I have not investigated) is whether there are some unique properties of the users that I clustered that led to the severe lack of balance with k-means. I don’t think so – the 20,000 or so MovieLens users that we used as our data have two characteristics that seem particularly regular:
- Power law activity
- Normal distribution of ratings (1-5 scale)
Another person suggested that I simply didn’t tweak k-means well enough (and the other algorithms I tried). Well, I turned plenty of knobs, many times. In the process of the knob turning, I realized that clustering metrics are a fairly bad way of gauging the “feel” of a cluster of users. I’d like to dig more into clustering metrics that reflect the different ways that groups of people might come together.