Smorgasbord - Politics, Lisp, Rails, Fencing, etc.: May 2006

Sunday, May 28, 2006

On this day:

ICML 2006 - accepted papers

The PDF copies of accepted papers is now available. Check out paper no. 107.

submit ICML 2006 - accepted papers to digg.com

Friday, May 19, 2006

On this day:

Improving performance of classifier gem

In the post acts_as_classifier I had mentioned that the Bayesian classifier in the gem classifier needs some work. After, that I got some emails asking me more details about what it needs, I will briefly mention some of the points which I think can improve it. The current gem is absolutely ok, the features below will just add more value to it, and may help in improving classification accuracy.

1. It should take prior probability into consideration. For example, a certain kind of document may be likely to occur than the other, using prior probability will help in incorporating this factor into decision making.

2. While classifying, currently the probability P(x|y) is calculated as: the number of times the word has been in a particular category/ total number of words in that category.
Instead of using, total number of words in a category, use number of documents in that category. The latter is the standard technique and gives better result.

3. Cannot use cost-sensitive learning. This will help in scenarios where some kind of decisions are more important than others. For example in preliminary testing, not missing a tumor is more important than incorrectly giving the result that someone has a tumor.

4. Cannot use threshold in the classify method, but this can be easily worked around.

There are other fine tunings which can be done, but they are problem specific, and require some empirical experiments.

But despite all this, there is the standard disclaimer used in Machine Learning.

There is something called No-free-lunch.

submit Improving performance of classifier gem to digg.com

Tuesday, May 16, 2006

On this day:

acts_as_classifiable

Consider a scenario where you have a blog, and you want to prevent spam comments, but don't want to use captcha as they are bad from accessibility standpoint. What is the solution, in steps `acts_as_classifiable'. Use a Bayesian classifier to distinguish between spam and non-spam and if the comment is flagged as a spam, use a captcha based solution or reject it. Or maybe you want to track the preferences of each user and then based on that make suggestions to them. `acts_as_classifiable' can help in both scenarios and several other. Currently I use it for the web application at Kreeti.com.

To use this plugin, you need to have the gem `classifier' and its dependencies installed. The command below should do it.


gem install classifier --include-dependencies

The plugin itself can be downloaded from
http://opensvn.csie.org/sksinghi/acts_as_classifiable/

Next your database needs to have a table named `classifier_models'. This is used as a persitent store for the built classifier model.


  create_table :classifier_models, :force => true do |t|
      t.column :identifier, :int
      t.column :classifiable_type, :string, :null => false
      t.column :data, :blob
  end

Now, to use this plugin in your model, put:


class Comment < ActiveRecords::Base
acts_as_classifiable :fields => ["text"], :categories => ["Spam", "Legit"]
end

Let us assume that we have an instance of the above model in '@comment'. Then the classifier can be trained by calling the method `train'


@comment.train :legit


@comment.train :spam

Better have some additional helper functions in the model which will do so.
You can also untrain (use it with care) an instance, by using


@comment.untrain :spam

Bulk training and untraining is also possible by


Comment.train @comments, @classifications

where both @comments and @classifications are arrays, such that @classifications contain categorization of each message in @comments.

To use, the classifier to make classification:


@comment.classify

this will return either "Spam" or "Legit". Bulk classification is also possible.

If you want the comment class to have multiple classifiers, one for each user, then all the above methods can be given an additional argument `identifier'.


@comment.train :legit, @user.id

This will create and store a classifier for that particular model, identified by the `identifier'. This can be used in scenarios when one wants to track preference of each user and then want to make suggestions.

The Bayesian classifier in the gem classifier needs some work, but more about it later.

Any questions/issues about this plugin, please post it as a comment or email me.

Update: New release Clusterer + other plugins

book mark acts_as_classifiable in del.icio.us

Sunday, May 14, 2006

On this day:

Share your creativity - Kreeti.com

The Beta site has gone live. The ideas is that people can share their jokes, narrations, recipes, and verses with each other, then comment upon them and rate them. It is also possible to subscribe to rss feeds and follow the top or recent items in each category. Also, kreetis or creations can be assigned different tags, and then searched upon based on the tags. There are also tag specific rss feeds.

submit Share your creativity - Kreeti.com to digg.com

Tuesday, May 09, 2006

On this day:

On Reviewers

If you are new to writing research papers then, I will suggest first reading this article. A technical paper should have some standard, before it can be considered worth reviewing; otherwise most reviewers will immediately reject it. Though, my experience of reviewing is only limited to computer science research papers, but I think my analysis here is quite general and should be applicable to other sciences also.

Although, there are occasional exceptions but generally most paper reviewers fall under one of the following categories:

Experienced Professor - These people generally know about a wide range of topics, and very likely have an idea about the specific field which the paper addresses. But, generally their knowledge is very high level, and they don't go into or understand the details of the paper. Such reviewers generally pay special attention to the introduction and conclusion of the paper. You should clear state in these two sections what are the contributions of the paper, and why should it be accepted for being published. They will generally never give a strongly favorable reviews, and will always find out some fault or the other, for example the authors should consider some more experimental cases, the paper is weak in theory, or weak proof, not enough contribution, or the authors failed to consider some related research, etc. Extremely tough reviewers, but luckily very seldom do such people review papers, rather they assign this task to their graduate students.

Sincere but nescient student - Here I mean a graduate student who knows about the field in large but is unfamiliar with the specific topic which the paper is addressing. Such, reviewers work extremely hard and consult many sources such as Google search and the references mentioned in the paper, to get an overview of what the paper means. These reviewers generally comment on the general quality of the writing, experiments and explanation in the paper. These reviewers are unable to point out if there is any lack of novelty in the paper, or if the technique is some old known method in a new cloak. They generally give neutral review, or if your paper is well written then a favorable reviews. The most common type of reviews.

Expert student - Such people rarely review the paper, because generally if a student is expert in some topic then he must be doing research in that area and might have his own papers in that track (or someone else in his research team) and hence not eligible for reviewing papers from that track. These reviewers are extremely harsh, because they can understand and see through everything mentioned in the paper. If you are able to escape with minor criticism then you should consider yourself lucky and quality of your research extremely good, because for such reviewers most contributions seem minor.

Lazy and dumb student - These qualities go side by side, these students are generally dumb, due to their laziness. Such reviewers generally look at the general structure of the paper, and will give it a favorable review with maybe some minor criticism. They hardly bother to understand the paper, and require that it should be coherent and easy to read.

Inexperienced professors - These are new generally new PHD graduate students who have been just been appointed as a professor in some college. These reviewers have very strong expertise in their research areas, but again they don't review paper in that area, as they will generally also have some submissions in the same track. But unlike experienced professors these people do have to review papers, as they do not have many graduate students working under them, besides they also have the enthusiasm of a new comer.
These reviewers are generally lenient, because they want other reviewers to be lenient on their work also, and hope that somehow their charity work will help them in return. Overall, their reviews are generally high quality, because even if they don't know much about the specific topic, they do have the experience of reviewing several papers in their student years, and besides they are hard working which helps them in quickly learning and getting acquainted with the field.

If you want to add something to my this categorization or argue against it, then feel free to post it as a comment.

Next, review of selected ICML 2006 papers.

Wednesday, May 03, 2006

On this day:

Securing Rails application

Using Ruby on Rails has been fun. An important part of developing a web application is making sure that most of the security loopholes are plugged in. Here are few basic things which should be checked and fixed in any security audit:

User trying to execute javascript or inserting html statement, or Cross-Site Scripting (CSS/XSS).
http://manuals.rubyonrails.com/read/chapter/44 provides a tutorial in this, but the basic idea is that use HTML-escaping function `h' or `sanitize' while displaying all data input by the user. Even the XML/RSS data which is displayed should be secured. Sanitize is generally safe but there may still be some hidden security flaws.

SQL Injection or user maliciously inserting sql statements in you queries.
http://manuals.rubyonrails.com/read/chapter/43
has again a tutorial chapter on this. The key point is not using sql directly, but in some cases where you need to then instead of embedding user input in sql by using
" ..subject = #{param[:subject]} ..."
use
" ..subject = ? ...", param[:subject]
Also, turn off the echo service on production web server.
Rails makes it so easy to avoid common SQL injection attack, whereas I remember working in a corporate environment and using PHP, and in all most all sql statements in the code, it was possible to do SQL injection attack.

Creating records directly from form parameters.
Another chapter http://manuals.rubyonrails.com/read/chapter/47 giving tutorial details. The key point is that to prevent some model fields being updated by the form parameters directly (or en masse), use `attr_protected' or the more secure `attr_accessible'.

Not exposing controller methods.
Make controller methods which should not be accessible to the user as `private' or `protected'.

Checking file/attachment uploads.
It is generally safer to store the files in the database. Check the size of the file being uploaded, and make sure that it doesn't exceeds the permissible limit. Also, check the file extension, and make sure that it is a valid extension and not `*.cgi', `*.php', `*.js' etc.

Be careful about ID parameters.
For certain operations, for example say displaying email, a user should be only able to read his or her email. In this case, in the email displaying controller, a check should be made to verify that the user is authorized to read the email.