2

I am subscribed to a mail list where some of the messages are non-english which I cannot understand.

How do I filter the non-english messages to /dev/null using procmail and/or command line tools?

I use procmail to filter my email, so ideally any alternative tool would also require a procmail recipe.

I'd prefer not to have to train my own language models.

makeyourownmaker
  • 1,164
  • 1
  • 9
  • 29

2 Answers2

2

One way is to use the perl TextCat package from Gertjan van Noord.

The text_cat script outputs the most likely language for the mail. This recipe assumes text_cat has been installed under /usr/local/bin.

Here is a simple procmail recipe to call the text_cat script:

:0
* ^Subject.*Jobs.*Board
{
    LANG_=`/usr/local/bin/text_cat`

    :0
    * ! LANG ?? ^english$
    /dev/null

    :0
    jobs/
}

I've been running text_cat for a few years. There haven't been any non-english messages classified as english, that is, no false-positives. I've not been rigorous about checking for false-negatives.


A second way, as mentioned by tripleee in a comment, is to use the language categorisation provided by spamassassin which also uses the text_cat script. Spamassassin will unwrap any MIME transfer encodings which the vanilla text_cat version above won't.

Here is an incompletely tested procmail recipe for filtering on the spamassassin X-Spam-Languages header:

:0
* ^Subject.*Jobs.*Board
{    
    # Delete non-english language emails using spamassassin header
    # Test for not X-Spam-Languages: en
    :0
    * !^X-Spam-Languages: en$
    foreign/

    # Save english language mails in folder
    :0
    jobs/
}

Warning: spamassassin will occasionally provide multiple language categorisations like so:

X-Spam-Languages: en da ro

which the above recipe does not account for.

Spamassassin Language Categorisation Configuration

Edit /etc/spamassassin/v310.pre and uncomment the following line:

loadplugin Mail::SpamAssassin::Plugin::TextCat

Configure the plugin in /etc/spamassassin/local.cf:

ok_languages en       # I understand english
inactive_languages '' # Enable all languages
add_header all Languages _LANGUAGES_
# score UNWANTED_LANGUAGE_BODY 5 # Increase score - not necessary and not recommended 

This recipe was incompletely tested with spamassassin version 3.4.2.


To adapt these answers to excluding a different language would involve substituting the other language for english in the first case and substituting the other 2 character language code for en in the second case.

makeyourownmaker
  • 1,164
  • 1
  • 9
  • 29
  • Keep in mind that the models which TextCat ships with are pretty crude, and don't do a good job of distinguishing between e.g. Danish and Norwegian or Serbian and Russian. There are other language identification tools with a similar interface, or you could train your own TextCat models. – tripleee Jun 23 '20 at 16:39
  • Good to know, but I'm interested in english versus everything else. I'd prefer not to have to train my own models (I've updated my question accordingly). – makeyourownmaker Jun 23 '20 at 16:43
  • 1
    Another problem is that TextCat by itself knows nothing about email encodings. You might prefer to use a mail-aware filter like SpamAssassin, which includes an integrated fork of TextCat, but also takes care of unwrapping any MIME transfer encodings before running it. – tripleee Jun 23 '20 at 16:48
  • I've been running text_cat for a few years. There haven't been any non-english messages classified as english. I've not been rigorous about checking the reverse case. – makeyourownmaker Jun 23 '20 at 17:00
  • A procmail recipe using the SpamAssassin text_cat header would be useful and potentially something I'd accept. – makeyourownmaker Jun 23 '20 at 17:11
  • 1
    `formail -A` is not necessary just to evaluate the value of `$LANG`. – xebeche Jun 23 '20 at 21:19
  • @xebeche Your probably correct. The man page suggests `formail -a` would suffice but I've not tested it. – makeyourownmaker Jun 23 '20 at 21:31
  • 1
    @makeyourownmaker That's not what I meant. Procmail has the syntax `variablename ??` to test the values of variables. Cf. manpage procmailrc(5). – xebeche Jun 23 '20 at 21:40
  • @xebeche I understand what your saying now. Feel free to edit the answer. I'd need to run some tests before making any changes. Good point. Thanks. – makeyourownmaker Jun 23 '20 at 21:49
  • Credit should go to tripleee for making the simplifying edit suggested by xebeche. – makeyourownmaker Jun 24 '20 at 09:16
1

Many modern email clients identify the character set of the email message, though not usually its language. If you want to discard Japanese, Chinese, Korean, and Russian messages, you could try something like

:0HB
* ^Content-type:[  ]*text/[/;]*;[  ]*charset="?(iso-2022|ks-c|gb|koi|cp-1251)
foreign

Because some clients forget to change the character set when they write in English, this is likely to produce some false positives, so I recommend saving to a folder and reviewing it periodically. The opposite problem is harder; many foreign languages use the same character set as English, and thus can't be identified like this with any reliability.

tripleee
  • 139,311
  • 24
  • 207
  • 268
  • This is a useful answer (I've upvoted), but given the limitations you list I'm inclined to accept my answer (assuming that is possible and only if there are no better answers within a week or two). – makeyourownmaker Jun 23 '20 at 16:59
  • You can accept your own answer after a while (IIRC two days) and later change the accepted answer if anything better comes up. I agree at this point that yours is the better answer. Adapting it to use SpamAssassin should be reasonably trivial. – tripleee Jun 23 '20 at 17:30
  • I've added some pointers for configuring language categorisation with spamassassin. And I've adapted my earlier procmail recipe to use the spamassassin languages header. – makeyourownmaker Jun 23 '20 at 19:50