Cheap Shared Hosting

Tuesday, April 1, 2008

Anti Comment Spam System in ColdFusion

This is a system I've been working for for probably a couple years now total. It runs on a fairly busy site, and is meant to prevent comment spam. The comment system itself uses flat files, and is a custom-coded ColdFusion system, built to order.

I'm not a huge fan of ColdFusion - it seems like everything I try to do with it is about twice as difficult as it should be - perhaps I'm just too used to PHP and Perl. Anyways, I'm going to give a general overview of the system. I can't post code, because the code belongs to the client. But perhaps the general idea of the system will be enough to be helpful to someone out there.

So, when someone leaves a comment, it's immediately visible and an email is sent to the admin, so he knows and can take action if the comment is spam. That worked fine for far longer than you'd expect, but eventually the spammers found it and started bombarding him with crap.

So, the first line of defense we added was quite simple - it blocked people trying to submit urls using common forum codes, i.e. "[url]some spammy site[/url]" - this was an easy fix, since the site is custom code and it doesn't say anywhere to use such a code to add a url, there wasn't much chance of getting false positives.

That helped quite a bit, but eventually we needed more. Some spammers will just post straight links, and some just post random gibberish for no apparent reason - just apparently random strings of pill names and other crap. So we added the 2nd line of defense, a customizable list of terms that the admin could use to block any comment that had such a term in any of its fields.

Now when a comment is blocked by the system, it is basically ignored - nothing is logged, it never shows up on the site, no email is sent to the admin.

Eventually, the client needed a third level of protection - spammers were still posting links. As a temporary fix, he blocked "http", but spammers would still post domain names, etc. And plus it was possible that actual legit users would want to post links as they discuss the articles on the site, so he knew he needed a better system.

We talked about using a Captcha (those visual puzzles where you have to type the letters in the picture) but decided against it both for accessibility reasons (lots of people have problems with those, plus how is a blind user to use it?) and because I'm not sure they provide much protection any more - after all the gmail captcha has been broken for quite a while!

So I decided to implement one of those simple question and answer systems. Basically, the user must answer a simple arithmetic problem before posting. The assumption is that bots would not be smart enough to answer such a problem, at least not without custom programming by the spammers - and they'd rather just move on to an easier target - of which there are many!

To implement it, the system picks to random one digit positive numbers, and asks the user to enter the sum. The tricky part is that the answer must be encoded in the form itself, so that the system knows what the correct answer is. So, the actual correct answer is also stored in a hidden field in the form. But obviously this makes it easy for a bot to "see" and use, so it must be obfuscated somehow. To do this, I hash the correct answer, using a key which changes daily.

A "hash" basically "makes a hash" of a string - generating a unique bunch of gibberish, using the key that's passed to it. The system picks a key using a number of factors including things like the date, the ip of the user, the article url, etc. The idea is you need a key that is fairly hard to guess, but that the form processing script can also come up with on its own, without passing the key to it somehow.

So, this correct answer is hashed and put into a hidden field. When the user submits the form, the comment script takes the users answer and hashes it with that same key, then compares the result to the hashed correct answer submitted in the hidden field. If the answer is correct, then the comment is posted (assuming it passed the older tests as well) - if not the user is told to go back and try again.

So far this system is working. I still have some tricks up my sleeve for when it needs to be updated again - one that looks promising is taking advantage of the fact that bots tend to fill out all the fields in a form - even fields that a real user can't see. For example, you could create some dummy fields and hide them via CSS. A real user wouldn't see them, while a bot reading the source code of the page would. So if you see text in one of those dummy fields, you know it's a bot - and block it.

I also haven't done anything with IP addresses - so adding a simple IP blocking system may be worth pursuing at some point.

No comments: