Improving performance of classifier gem
In the post acts_as_classifier I had mentioned that the Bayesian classifier in the gem classifier needs some work. After, that I got some emails asking me more details about what it needs, I will briefly mention some of the points which I think can improve it. The current gem is absolutely ok, the features below will just add more value to it, and may help in improving classification accuracy.
1. It should take prior probability into consideration. For example, a certain kind of document may be likely to occur than the other, using prior probability will help in incorporating this factor into decision making.
2. While classifying, currently the probability P(x|y) is calculated as: the number of times the word has been in a particular category/ total number of words in that category.
Instead of using, total number of words in a category, use number of documents in that category. The latter is the standard technique and gives better result.
3. Cannot use cost-sensitive learning. This will help in scenarios where some kind of decisions are more important than others. For example in preliminary testing, not missing a tumor is more important than incorrectly giving the result that someone has a tumor.
4. Cannot use threshold in the classify method, but this can be easily worked around.
There are other fine tunings which can be done, but they are problem specific, and require some empirical experiments.
But despite all this, there is the standard disclaimer used in Machine Learning.
There is something called No-free-lunch.