How the west was clustered

By Nathan Donaldson in Development on October 27, 2009

Image 0500 full 2x
Screen shot 2009-10-27 at 9.20.25 AM
A cluster on IntuitionHQ

Our new product, IntuitionHQ, shows clusters of clicks on an image. To generate these clusters we made use of a gem called Hierclust. The great thing about this gem is it’s simplicity – just input the points and a minimum cluster separation, and out come the clusters.

The problem with Hierclust was the performance. With fewer than 100 points to cluster Hierclust was running too slow to do it dynamically. This was no problem, we moved the clustering program into a cronjob and stored the data in a marshalled file.

However, in testing we found that Hierclust was still too slow. Once we had over 200 points being clustered it started taking minutes to process – an unsustainable amount of time for the data we expected. The graph below shows the timings, which I believe is O(n3). We had to disable cluster processing while looking at the problem due to issues it was causing on the server.

Screen shot 2009-10-27 at 10.14.36 AM
Graph of points v time taken

ai4r

The first course of action was to look for an alternative way to do the clustering. Looking at several solutions, the most promising looked to be the gem ai4r. This provided a number of different clustering algorithms, and was faster / more scalable. Unfortunately the clustering algorithms needed to know the number of clusters – information that we didn’t want to estimate.

Documentation

It’s important to always read the documentation of gems you’re using. After going back to Hierclust and looking through the code I found a value I could tweak to improve the performance at the cost of accuracy. This bought us some time by reducing the processing time of our current data from minutes to 10-20 seconds. It wasn’t going to scale, but we could re-enable the cronjob.

ruby-prof

We then decided that if we could improve the performance of the Hierclust gem itself we didn’t need to worry that it was taking so much longer for each point, as there would be a natural limit to the number of points viable for a project. Using Benchmark and ruby-prof. I started to go through the gem with a fine tooth comb. I had test data with 250 points, taking 11 seconds to run. You can take a look at my test file here: http://gist.github.com/219094.

Profiling the code showed that it was spending a lot of time flattening arrays. Looking at every place where flatten was called I found that where each cluster contained an array of points, each of which could be a cluster, flatten was called. By simply changing the code to store the flattened array in an instance variable (invalidating when required) the time to run the test data went from 11 seconds to 3 seconds, a nearly 4 times improvement.

 %self     total     self     wait    child    calls  name
 15.31      9.02     9.02     0.00     0.00  4404288  Hierclust::Point#points
 12.41     35.59     7.31     0.00    28.27   920825  Array#map-1
 12.31     30.76     7.26     0.00    23.50  1060120  Array#map
 12.16      7.17     7.17     0.00     0.00  1980941  Array#flatten
  6.94     26.08     4.09     0.00    21.99   530058  Hierclust::Cluster#x
  6.77     38.33     3.99     0.00    34.34  1060116  Hierclust::Cluster#points
  6.74     25.28     3.97     0.00    21.30   530058  Hierclust::Cluster#y
  6.31     56.69     3.72     0.00    52.97   265029  Hierclust::Point#distance_to

Table lookup

A 4 times improvement was good, but we needed something closer to a 10 times improvement. I couldn’t find anywhere else in the code to improve the performance. However, we did note that quite a lot of time was spent calculating the distance between points (a Pythagoras calculation). Lookup tables are faster, we thought, so why not pre-generate possible values and load them from disk. This was quickly discounted when some quick maths showed us it would require gigabytes of space; probably not faster then.

RubyInline

The next step was to look at throwing some C code into the gem to improve the performance. I tried recoding various methods using RubyInline, a brilliant gem that makes it easy to add small snippets of C code to ruby. For example, the following ruby code:

def distance_to(other)
  Math.sqrt((other.x - self.x) ** 2 + (other.y - self.y) ** 2)
end

Can be rewritten in the same file in C:

inline do |builder|
  builder.include '
'
  builder.c "
    double pyth(double x1, double y1, double x2, double y2) {
      double rr = pow((long)x2 - (long)x1, 2) + pow((long)y2 - (long)y1, 2);
      return sqrt(rr);
  }"
end

Unfortunately the speed benefits were very small, 3 seconds to 2.5 seconds, or 4.5 times faster than the original 11 seconds. You can see my efforts at patching Hierclust here: http://github.com/jemmyw/hierclust.

C Extension

I reasoned from the data I got from ruby-prof that the main bottleneck in the gem, apart from the distance calculation, was in doing array operations. Using RubyInline wasn’t going to solve that. So, using my rusty knowledge of C and this PDF I set about rewriting the Hierclust algorithm in C. As we didn’t require hierarchical clustering I dropped that part of the algorithm and just stored the cluster information with coordinates and cluster size. I used the previous test data of 250 points during development to ensure that the output from the C extension was exactly the same as the output from Hierclust, and once complete I ported Hierclust’s spec files to the new gem.

The results were well worth the effort. The same set of 250 points that took 11 seconds with Hierclust and 3 seconds after my modifications, took ~0.009 seconds using the new C based gem, 1000 times faster. The test file required only changing three lines (class names and requires).

The fastcluster gem can be downloaded:

gem install fastcluster

Or you can get the source code from http://github.com/jemmyw/fastcluster.

Test files for Hierclust and fastcluster: http://gist.github.com/219094