0

I know basics of java but I am not too experienced with regex or patterns, so please excuse me if im asking something super simple.. Im writing a method that detects IP addresses and hostnames. I used the regex from this answere here. The problem I am encountering though is that sentences without symbols are counted as host names

Heres my code:

    Pattern validHostname = Pattern.compile("^(([a-z]|[a-z][a-z0-9-]*[a-z0-9]).)*([a-z]|[a-z][a-z0-9-]*[a-z0-9])$",Pattern.CASE_INSENSITIVE);
    Pattern validIpAddress = Pattern.compile("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])([:]\\d\\d*\\d*\\d*\\d*)*$",Pattern.CASE_INSENSITIVE);
    String msg = c.getMessage();
    boolean found=false;

    //Randomly picks from a list to replace the detected ip/hostname
    int rand=(int)(Math.random()*whitelisted.size());
    String replace=whitelisted.get(rand);

    Matcher matchIP = validIpAddress.matcher(msg);
    Matcher matchHost = validHostname.matcher(msg);

    while(matchIP.find()){
        if(adreplace)
            msg=msg.replace(matchIP.group(),replace);
        else
            msg=msg.replace(matchIP.group(),"");

        found=true;
        c.setMessage(msg);
    }
    while(matchHost.find()){
        if(adreplace)
            msg=msg.replace(matchHost.group(),replace);
        else
            msg=msg.replace(matchHost.group(),"");

        found=true;
        c.setMessage(msg);
    }
    return c;
Community
  • 1
  • 1
Zach
  • 432
  • 6
  • 15

1 Answers1

2

Description

Without sample text and desired output, I'll try my best to answer your question.

I would rewrite you host name expression like this:

A: ^(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ will allow single word names like abcdefg

B: ^(?=(?:.*?\.){2})(?:[a-z][a-z0-9-]*[a-z0-9](?=\.[a-z]|$)\.?)+$ requires the string to contain at least two period like abc.defg.com. This will not allow a period to appear at the beginning or end, or sequential periods. The number inside the lookahead {2} describes the minimum number of dots which must appear. You can change this number as you see fit.

enter image description here

  • ^ match the start of the string anchor
  • (?: start non-capture group improves performance
  • [a-z][a-z0-9-]*[a-z0-9] match text, taken from your original expression
  • (?=\.[a-z]|$) look ahead to see if the next character is a dot followed by an a-z character, or the end of the string
  • \.? consume a single dot if it exists
  • ) close the capture group
  • + require the contents of the capture group to exist 1 or more times
  • $ match the end of the string anchor

Host names:
A Allows host name without dots
B Requires host name to have a dot

Live Demo with a sentence with no symbols

I would also rewrite the IP expression

^(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?::\d*)?$

The major differences here are that I:

  • removed the multiple \d* from the end because expression \d*\d*\d*\d*\d*\d* is equivalent to \d*
  • changed the character class [:] to a single character :
  • I turned the capture groups (...) into non-capture groups (?...) which performs a little better.
animuson
  • 50,765
  • 27
  • 132
  • 142
Ro Yo Mi
  • 13,586
  • 4
  • 31
  • 40
  • This helps so much! The only thing I can spot that's off is that the regex catches singular words ex."hello" with no special charcters. – Zach Aug 22 '13 at 20:09
  • What are your requirements for a host name? Should it be a FQDN which must contain a period? – Ro Yo Mi Aug 23 '13 at 00:43
  • The host name should at least have "SiteName.com" in it, but things like www. I would like to count as well. And yes just periods. Im not really concerned about ipv6 – Zach Aug 23 '13 at 01:49