3

Say I want to extract the hostname and the port number from a string like this:

stackoverflow.com:443

That is pretty easy. I could do something like this:

(?<host>.*):(?<port>\d*)

I am not worried about protocol schemes or valid host names/ip addresses or tcp/udp ports, it is not important to my request.

However, I also need to support one twist that takes this beyond my knowledge of regular expressions - the host name without the port:

stackoverflow.com

I want to use a single regular expression for this, and I want to use named capture groups such that the host group will always exist in a positive match, while the port group exists if and only if we have a colon followed by a number of digits.

I have tried doing a positive lookbehind from my feeble understanding of it:

(?<host>.*)(?<=:)(?<port>\d*)

This comes close, but the colon (:) is included at the end of the host capture. So I tried to change the host to include anything but the colon like this:

(?<host>[^:]*)(?<=:)(?<port>\d*)

That gives me an empty host capture.

Any suggestions on how to accomplish this, i.e. make the colon and the port number optional, but if they are there, include the port number capture and make the colon "vanish"?

Edit: All the four answers I have received work well for me, but pay attention to the comments in some of them. I accepted sln's answer because of the nice layout and explanation of the regexp structure. Thanks to all that replied!

Rune Jacobsen
  • 9,297
  • 11
  • 51
  • 72
  • 2
    EDITED - Not tested, but try for example this: (?[^:]+)(:(?\d+))? Remember that question mark itself can be used to define optional characters or whole groups. – Zoltán Tamási Mar 27 '14 at 19:39
  • Jerry: I should have mentioned - this is part of a bigger, more complex regexp that does more than just the host/port stuff. So I just wanted to isolate the part I'm having trouble with. – Rune Jacobsen Mar 27 '14 at 19:47
  • Zoltán: So basically a nested expression? Wow, that takes regular expressions to the next headache level. :) Thanks, will try! – Rune Jacobsen Mar 27 '14 at 19:51
  • @RuneJacobsen, yes, because you want a whole optional group (the colon followed by the port), and want to catch the number part of it, so one group inside another makes sense. – Zoltán Tamási Mar 27 '14 at 19:55
  • @ZoltánTamási: It does indeed make sense. It seems sln's answer is close that what you suggested, if you had made an answer instead of a comment I would accept that. :) – Rune Jacobsen Mar 27 '14 at 19:58
  • 2
    Not really a nested expression, an optional capture group that should be an optional cluster group, especially if you are counting named capture groups and/or named groups last in a larger expression. –  Mar 27 '14 at 20:00
  • @RuneJacobsen I posted an answer. I don't like posting untested code just for hoping it will be good :) – Zoltán Tamási Mar 27 '14 at 20:00
  • @ZoltánTamási: I tested it, and it works for my use. So does the answers from sln and Sabuj - I wish I could accept all three. :S – Rune Jacobsen Mar 27 '14 at 20:01
  • @RuneJacobsen you can definetely upvote all three at least :) Anyway, I don't mind if not my answer will be accepted, I'm glad I could help. – Zoltán Tamási Mar 27 '14 at 20:02

5 Answers5

4

I'm suggesting to use Uri class instead of regular expressions.

// Use URI class for parsing only
var uri = new Uri("http://" + fullAddress);
// get host
host = uri.DnsSafeHost;
// get port
portNum = (ushort)uri.Port;

The benefits are

  • It supports:
    • IPv4 and IPv6
    • Internationalized domain name (IDN)
  • Can be extended to take schema into account in the future
  • Short and standardised code, so less mistakes

See sample of using on .NET Fiddle

Alex Klaus
  • 6,507
  • 5
  • 56
  • 71
  • 1
    This question has been added to the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496) as a non-regex alternative, under "Common Tasks > Validation". – aliteralmind Jun 25 '14 at 02:39
  • Didn't consider this since the data I am parsing is not necessarily a Uri, but of course it makes sense that you can do it this way as well. :) – Rune Jacobsen Jul 04 '14 at 12:25
2

This maybe (?<host>[^:]+)(?::(?<port>\d+))?

 (?<host> [^:]+ )               # (1), Host, required
 (?:                            # Cluster group start, optional
      :                              # Colon ':'
      (?<port> \d+ )                 # (2), Port number
 )?                             # Cluster group end

edit - If you were to not use the cluster group, and use a capture group as that cluster group instead, this is how Dot-Net "counts" the groups in its default configuration state -

 (?<host> [^:]+ )         #_(2), Host, required                           
 (                        # (1 start), Unnamed capture group, optional
      :                        # Colon ':'
      (?<port> \d+ )           #_(3), Port number                           
 )?                       # (1 end)
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Common Validation Tasks". – aliteralmind Apr 10 '14 at 01:18
  • 1
    @aliteralmind, please consider this answer (http://stackoverflow.com/a/24399003/968003) for that FAQ instead. – Alex Klaus Jun 25 '14 at 01:57
1

If your host name doesn't contain : like ipv64 then try this one:

(?<host>[^:]*):?(?<port>\d*)
Sabuj Hassan
  • 35,286
  • 11
  • 68
  • 78
  • This would match "stackoverflow.com8080", wouldn't it? – Zoltán Tamási Mar 27 '14 at 19:50
  • @ZoltánTamási But OP says `not worried about protocol schemes or valid host names` – Sabuj Hassan Mar 27 '14 at 19:51
  • I thought the colon between hostname and port is one level lower than valid host names and protocol schemas :) – Zoltán Tamási Mar 27 '14 at 19:53
  • Zoltán is right, it would match this, but Sabuj is also right - for this regexp, I want to parse this as well as possible, given potentially malformed input. In other regexps at other points in the code I will validate and warn about illegal/wrong input. – Rune Jacobsen Mar 27 '14 at 19:53
1

Try this:

(?<host>[^:]+)(:(?<port>\d+))?

This makes the whole colon and port number part an optional group, and catches the port number inside that. Also, I used the plus sign to ensure that hostname and port number contains at least one character.

Zoltán Tamási
  • 10,479
  • 3
  • 51
  • 73
1

You can use this :

(?<host>[^:]+)(:(?<port>\\d+))?
brz
  • 5,536
  • 1
  • 16
  • 18
  • This works, but could you possibly explain the reason for the two backslashes in front of the d? I.e. I understand that \d represents a digit. The difference between one and two backslashes seems to be the number of capture groups returned. – Rune Jacobsen Mar 27 '14 at 20:36
  • It's for escaping the backslash in C# strings. It shouldn't be there in this context but in a normal c# string you have to escape it as you know. – brz Mar 27 '14 at 20:38
  • @user3246354, regex should be almost always declared with verbatim string using at sign, so you don't need to worry about escaping the backslashes. Usually a regex is complex enough without that too. – Zoltán Tamási Mar 27 '14 at 20:46
  • Yes, it was my mistake. – brz Mar 27 '14 at 20:47