6

I'm looking for articles on ways to filter spam. When I search around all I keep finding is Wordpress, ways to filter swear words etc which is not what I'm looking for. I'm looking for ways to write your own filter system and best practices.

Any tutorial links from anyone who has done this before, would be appreciated.

Only good article i can so far is http://snook.ca/archives/other/effective_blog_comment_spam_blocker

Sean H Jenkins
  • 1,728
  • 3
  • 19
  • 28

7 Answers7

12

When writing your own method, you'll have to employ a combination of heuristics.

For example, it's very common for spam comments to have 2 or more URL links.

I'd begin writing your filter like so, using a dictionary of trigger words and have it loop through and use those to determine probability:

function spamProbability($text){
    $probability = 0;  
    $text = strtolower($text); // lowercase it to speed up the loop
    $myDict = array("http","penis","pills","sale","cheapest"); 
    foreach($myDict as $word){
        $count = substr_count($text, $word);
        $probability += .2 * $count;
    }
    return $probability;
}

Note that this method will result in many false positives, depending on your word set; you could have your site "flag" for moderation (but goes live immediately) those with probability > .3 and < .6, have it require those >.6 and <.9 enter a queue for moderation (where they don't appear until approved), and then anything over >1 is simply rejected.

Obviously these are all values you'll have to tweak the thresholds but this should start you off with a pretty basic system. You can add to it several other qualifiers for increasing / decreasing probability of spam, such as checking the ratio of bad words to words, changing weights of words, etc.

Tim
  • 14,221
  • 6
  • 37
  • 62
  • 1
    wait a moment... you can't do each time a strtolower($text) for each word. – dynamic Dec 07 '11 at 17:31
  • 1
    Why not? This way it catches variations on case. We're not outputting the new lowercase string, we're just comparing it and getting rid of it. – Tim Dec 07 '11 at 17:34
  • Unless you mean the efficiency is bad, which yes, it is, this is just a concept example. – Tim Dec 07 '11 at 17:34
  • Updated to only run one lowercase. – Tim Dec 07 '11 at 17:35
  • I figure his site is probably pretty low traffic and comments aren't more than a few hundred characters long so once he learns the concepts he can work on creating a faster method. There'd be no point answering this question and presenting the OP with one, four hundred character long regular expression, am I right? ;-) – Tim Dec 07 '11 at 17:38
  • Thanks for the replies all. Much appreciated – Sean H Jenkins Dec 07 '11 at 17:58
2

I'm surprised no one mentioned Akismet. I've never had a message marked wrong (be it spam or legit). My WordPress install came with it. All I had to do was hit enable.

Brigand
  • 75,952
  • 19
  • 155
  • 166
1

You could have a look at the b8 spam filter: http://nasauber.de/opensource/b8/

Tobias Leupold
  • 1,182
  • 1
  • 10
  • 32
1

Are you looking for a way to stop spam from bots and such? If so you can always add a CAPTCHA: http://en.wikipedia.org/wiki/CAPTCHA Should be easy enough to put on any project if this is what you are trying to do. Otherwise I am not sure what you are saying in terms of filtering spam.

Hudspeth
  • 136
  • 12
  • Well a captcha is one way but it won't stop people who manually write spam comments. Really, I'm looking for a function or functions that could take a comment and output a spam probability. – Sean H Jenkins Dec 07 '11 at 17:15
  • Doesn't seem like something that you could easily do. Predicting human created spam is like posting a comment anyway. If they are taking the time to spam you by hand then they will find ways to spam you anyway. The only block to something like this might be some IP blocking if you notice the spam comes from certain IP addresses. – Hudspeth Dec 07 '11 at 17:33
  • In my question I posted a way which this can be achieved, but I was looking for different systems / methods. IP blocking is not effective as so called 'hardcore' comment spammers, will use proxies to bounce requests, therefore blocking IP addresses will, in the long run, lose you visitors. – Sean H Jenkins Dec 07 '11 at 17:56
1

Here is another good tutorial about dealing with spammers and there spams... :

How To Stop Manual Comment Spammers

Here is a link to a good similar SO question:

non-captcha methods for blocking spam on my comments

Hope this helps.

Community
  • 1
  • 1
AlphaMale
  • 23,514
  • 4
  • 57
  • 77
0

I guess this article The war with spam comment can give you some hints. Of course nowadays some bots are smart enough, so you may need to add CAPTCHA as well.

PixelsTech
  • 2,736
  • 1
  • 29
  • 29
0

Consider implementing reCAPTCHA - here's a link - http://www.google.com/recaptcha and http://code.google.com/apis/recaptcha/docs/php.html