Eliminate Form Spam Using Behavioural Analysis and Individual Forms
Do you really need CAPTCHAs?
Spam bots are so effective today mostly because there’s only a handful of different publishing software used (forum, blog, CMS). This is very convenient, because all information required to write and tweak a bot is identical for all installations. All the bot software has to do is query Google for parts of a particular URL (eg. [url=http://www.google.com/search?q=inurl%3Ayabb.pl]inurl:yabb.pl[/url] or [url=http://www.google.com/search?q=inurl%3Aphpbb%2Fposting.php]inurl:phpbb/posting.php[/url]) and continue with algorithms to circumvent the specific protection (if any).
If you use less-known publishing software, or write your own, it is unlikely that you’ll be run over by bots. A popular software development blog for example, [url=http://www.codinghorror.com/blog/]CodingHorror[/url], has the exact same (easily readable and audible) CAPTCHA on every page, and no spam problems whatsoever – using a custom software. For years, I’ve been running a customized installation of PostNuke/phpBB – no CAPTCHA, no spam, no false registrations! Admitting to this fact, we can vastly improve the situation by using generated/customized URLs. This must be built-in in future versions of publishing software. (Besides, most other attacks also use simple name/path based analysis to find vulnerable servers).
Generally missing from all software solutions I know is behavioral analysis: Looking at the differences of human and robot activity, ranking form input similar to email spam filters. This must go beyond the short summary given by [url=http://www.ibm.com/developerworks/web/library/wa-realweb10/index.html?ca=drs-]IBM DeveloperWorks[/url]. While it sums up the most important issues, I miss a few others, such as form fields with popular names (name, firstname, email) hidden by CSS to trick the bot into filling these fields, IP-based filtering by countries to flag likely spam, and collaborative services like [url=http://en.wikipedia.org/wiki/DNSBL]DNSBLs[/url] and [url=http://akismet.com/]Akismet[/url]. What we need is an individually trainable algorithm that also takes web server access logs into account.
In my opinion, CAPTCHA should be the last resort, if other mechanisms fail.
Quite ironic that this blog uses reCAPTCHA, isn’t it?