Filter Comment Spam? PHP

Question

I'm looking for articles on ways to filter spam. When I search around all I keep finding is Wordpress, ways to filter swear words etc which is not what I'm looking for. I'm looking for ways to write your own filter system and best practices.

Any tutorial links from anyone who has done this before, would be appreciated.

Only good article i can so far is http://snook.ca/archives/other/effective_blog_comment_spam_blocker

score 12 · Accepted Answer · edited Aug 07 '13 at 17:18

12

When writing your own method, you'll have to employ a combination of heuristics.

For example, it's very common for spam comments to have 2 or more URL links.

I'd begin writing your filter like so, using a dictionary of trigger words and have it loop through and use those to determine probability:

function spamProbability($text){
    $probability = 0;  
    $text = strtolower($text); // lowercase it to speed up the loop
    $myDict = array("http","penis","pills","sale","cheapest"); 
    foreach($myDict as $word){
        $count = substr_count($text, $word);
        $probability += .2 * $count;
    }
    return $probability;
}

Note that this method will result in many false positives, depending on your word set; you could have your site "flag" for moderation (but goes live immediately) those with probability > .3 and < .6, have it require those >.6 and <.9 enter a queue for moderation (where they don't appear until approved), and then anything over >1 is simply rejected.

Obviously these are all values you'll have to tweak the thresholds but this should start you off with a pretty basic system. You can add to it several other qualifiers for increasing / decreasing probability of spam, such as checking the ratio of bad words to words, changing weights of words, etc.

edited Aug 07 '13 at 17:18

answered Dec 07 '11 at 17:29

Tim

14,221
6
37
62

1

wait a moment... you can't do each time a strtolower($text) for each word. – dynamic Dec 07 '11 at 17:31
1

Why not? This way it catches variations on case. We're not outputting the new lowercase string, we're just comparing it and getting rid of it. – Tim Dec 07 '11 at 17:34
Unless you mean the efficiency is bad, which yes, it is, this is just a concept example. – Tim Dec 07 '11 at 17:34
Updated to only run one lowercase. – Tim Dec 07 '11 at 17:35
I figure his site is probably pretty low traffic and comments aren't more than a few hundred characters long so once he learns the concepts he can work on creating a faster method. There'd be no point answering this question and presenting the OP with one, four hundred character long regular expression, am I right? ;-) – Tim Dec 07 '11 at 17:38
Thanks for the replies all. Much appreciated – Sean H Jenkins Dec 07 '11 at 17:58

score 2 · Answer 2 · answered Dec 07 '11 at 18:09

2

I'm surprised no one mentioned Akismet. I've never had a message marked wrong (be it spam or legit). My WordPress install came with it. All I had to do was hit enable.

answered Dec 07 '11 at 18:09

Brigand

75,952
19
155
166

score 1 · Answer 3 · answered Nov 24 '12 at 21:28

1

You could have a look at the b8 spam filter: http://nasauber.de/opensource/b8/

answered Nov 24 '12 at 21:28

Tobias Leupold

1,182
1
10
32

score 1 · Answer 4 · answered Dec 07 '11 at 17:13

1

Are you looking for a way to stop spam from bots and such? If so you can always add a CAPTCHA: http://en.wikipedia.org/wiki/CAPTCHA Should be easy enough to put on any project if this is what you are trying to do. Otherwise I am not sure what you are saying in terms of filtering spam.

answered Dec 07 '11 at 17:13

Hudspeth

136
12

Well a captcha is one way but it won't stop people who manually write spam comments. Really, I'm looking for a function or functions that could take a comment and output a spam probability. – Sean H Jenkins Dec 07 '11 at 17:15
Doesn't seem like something that you could easily do. Predicting human created spam is like posting a comment anyway. If they are taking the time to spam you by hand then they will find ways to spam you anyway. The only block to something like this might be some IP blocking if you notice the spam comes from certain IP addresses. – Hudspeth Dec 07 '11 at 17:33
In my question I posted a way which this can be achieved, but I was looking for different systems / methods. IP blocking is not effective as so called 'hardcore' comment spammers, will use proxies to bounce requests, therefore blocking IP addresses will, in the long run, lose you visitors. – Sean H Jenkins Dec 07 '11 at 17:56

score 1 · Answer 5 · edited May 23 '17 at 12:09

1

Here is another good tutorial about dealing with spammers and there spams... :

How To Stop Manual Comment Spammers

Here is a link to a good similar SO question:

non-captcha methods for blocking spam on my comments

Hope this helps.

edited May 23 '17 at 12:09

Community

1
1

answered Dec 07 '11 at 17:18

AlphaMale

23,514
4
57
77

PixelsTech · Answer 6 · 2014-02-22T01:40:33.260

0

I guess this article The war with spam comment can give you some hints. Of course nowadays some bots are smart enough, so you may need to add CAPTCHA as well.

edited Feb 22 '14 at 01:40

answered Feb 22 '14 at 01:31

PixelsTech

2,736
1
29
29

score 0 · Answer 7 · answered Dec 07 '11 at 17:29

0

Consider implementing reCAPTCHA - here's a link - http://www.google.com/recaptcha and http://code.google.com/apis/recaptcha/docs/php.html

answered Dec 07 '11 at 17:29

Filter Comment Spam? PHP

7 Answers7