0

I am not at all able to code in Perl; so, what seems like a simple thing -- writing a regex to score all URIs that are not for "com" or "net" or "org" TLDs -- is apparently beyond my skills. Could someone kindly enlighten me?

As an example I want https://foo.com.us/asdf?qwerty=123 to match and ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2 to not match.

justinzane
  • 1,829
  • 1
  • 25
  • 36

2 Answers2

1

The regex pattern

//(?:[a-z]+\.)*+(?!com/|net/|org/)

should do what you want. The slashes are part of the pattern, and are not delimiters

Here's a demonstration

use strict;
use warnings;
use 5.010;

my @urls = qw{
    https://foo.com.us/asdf?qwerty=123
    ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2
};

for ( @urls ) {
    say m{//(?:[a-z]+\.)*+(?!com/|net/|org/)} ? 'match' : 'no match';
}

output

match
no match
Borodin
  • 123,915
  • 9
  • 66
  • 138
  • I had tried ...`(?!com|net|org)\/` without success. [Though typo failure is possible. :)] Is there a rationale for including the final slash within the "or" group? – justinzane Jun 26 '15 at 15:47
  • @justinzane: without that slash the regex would reject a name where the TLD *starts* with any of those strings. For instance, `www.batman.comics` would fail the test. Actually I've rewritten a little it so that the URL doesn't have to have a path – Borodin Jun 26 '15 at 16:09
  • 1
    Nice use of [possessive greed](http://www.rexegg.com/regex-quantifiers.html#possessive)! Note that the URL may not have a path, or there might be a port or query instead of a path. I'd recommend the negative lookahead be `(?!(?:com|net|org)\b(?![.-]))` which includes a double negative to avoid a word break that is followed by a character allowed in a host name. Also, the character class should be `[\w-]` to allow numbers and dashes. – Adam Katz Mar 09 '16 at 21:51
0

You should use the URI module to separate the host name from the rest of the URL

This example extracts only the final substring of the host name, so it will look at, say, uk from bbc.co.uk, but it should serve your purpose

use strict;
use warnings;

use URI;

my @urls = qw{
    https://foo.com.us/asdf?qwerty=123
    ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2
};

for my $url ( @urls ) {
    $url = URI->new($url);
    my $host = $url->host;
    my ($tld) = $host =~ /([^.]+)\z/;

    if ( $tld !~ /^(?com|net|org)\z/ ) {
        # non-standard TLD
    }
}
Borodin
  • 123,915
  • 9
  • 66
  • 138
  • Since this is going in a Spamassassin `local.cf` rule, I need a basic Perl regex without loading any other modules. The reason, I believe, is that the regex must be usable after being processed by re2c and compiled for "spamd". – justinzane Jun 26 '15 at 15:44