Few days back I came across, the Carrot Clustering Framework this inspired me to write something similar for Ruby. So, I started off with this project, and have right now implemented the basic K-Means and Hierarchical Clustering algorithms.
The first release can be downloaded from Rubyforge using the following command
gem install clusterer
The gem requires the stemmer gem, as a dependency.
There are also two example files which shows, how to use the library by clustering search results returned by Yahoo and Google. To try the example, the corresponding API key is needed.
Basically, one has to pass an array of strings to the clustering algorithm, and it will return the index of the clustered elements.
Clusterer::Clustering.kmeans_clustering(["hello world","mea culpa","goodbye world"])
Clusterer::Clustering.hierarchical_clustering(["hello world","mea culpa","goodbye world"])
The result might be something like [[1,3],].
The method signature for K-means is as follows
def kmeans_clustering (docs, k = nil, max_iter = 10, &similarity_function)
K-means is a simple hill climbing algorithm, and can get stuck at local maxima, but it fast in nature. Just to ensure that the algorithm doesn't gets stuck in a state where it oscillates the max number of iteration is necessary.
When k=nil the algorithm finds k = Math.sqrt(docs.size) clusters.
def hierarchical_clustering (docs, k = nil, &similarity_function)
Hierarchical clustering gives much better results, but is comparatively slower, when data volume is quite high.
If you are using this gem in a live public facing site, then let me know; I would like to link to that.
Update: New release Clusterer + other plugins