17

I'm writing some code that processes URLs, and I want to make sure i'm not leaving some strange case out...

Are there any valid characters for a host other than: A-Z, 0-9, "-" and "."?

(This includes anything that can be in subdomains, etc. Esentially, anything between :// and the first /)

Thanks!

Daniel Magliola
  • 27,613
  • 56
  • 154
  • 235
  • Given that you are looking for "anything between :// and the first /", don't forget that you may have a port number in there too, as in http(s)://my.host.com:8080/... – fredw May 02 '12 at 17:38
  • See this question for regex https://stackoverflow.com/questions/106179/regular-expression-to-match-dns-hostname-or-ip-address/3824105#3824105 – What Would Be Cool Jan 15 '21 at 22:06

6 Answers6

31

Please see Restrictions on valid host names:

Hostnames are composed of series of labels concatenated with dots, as are all domain names1. For example, "en.wikipedia.org" is a hostname. Each label must be between 1 and 63 characters long, and the entire hostname has a maximum of 255 characters.

RFCs mandate that a hostname's labels may contain only the ASCII letters 'a' through 'z' (case-insensitive), the digits '0' through '9', and the hyphen. Hostname labels cannot begin or end with a hyphen. No other symbols, punctuation characters, or blank spaces are permitted.

Andrew Hare
  • 320,708
  • 66
  • 621
  • 623
4

no, that is all that is allowed

here is a reference if you like to read: http://www.ietf.org/rfc/rfc1034.txt

Russ Bradberry
  • 10,197
  • 17
  • 66
  • 83
  • 1
    Those are old rules, being changed afterwards so one should not rely on that. See RFC1123 for an example of change in one of those rules. – Patrick Mevzek Feb 12 '20 at 16:48
4

Depends at what level you do the validation (before or after the URL escaping). If you try to validate user input, then it can go way beyond ASCII (with big chunks of Unicode).

See http://en.wikipedia.org/wiki/Internationalized_domain_name

If you try to validate after all the escaping and the "punycode" is done, there is no point in validation, since that is already guaranteed to only contain valid characters by the old RFC.

Mihai Nita
  • 5,148
  • 23
  • 23
  • Hmmmmm, good point, I need to look into this to see whether it applies to me or not. I'm not exactly sure what you mean by before or after the escaping, and i'm not exactly sure how it applies to my particular situation (which is a bit weird). I'll have to experiment with this, thank you! – Daniel Magliola Jul 16 '09 at 13:04
  • What I mean by "before or after escape" "before escape": the stuff the user types. In that one can use things that the "after escape" url cannot use (for instance =/&?) "after escape": the url as used by low level dns/http/whatever (%3D%2F%26%3F). That "escaping" is more complex that "just replace invalid characters with %xx" for international characters – Mihai Nita Jul 19 '09 at 00:20
1

Keep in mind that besides the hostname rules of the Internet, DNS systems are free to create any names that they like. DNS servers could accept and reply to 8-bit binary requests: the DNS wire protocol does not forbid it.

This means that for internal LAN URLs you may have different rules, such as the underscore appearing in a host name.

Zan Lynx
  • 49,393
  • 7
  • 74
  • 125
  • That's a good thing to keep in mind. I like to ensure I always have a service that users can use if they reverse the hostname, encode it as utf-16, and then break it up into an array of 32-bit (big-endian) ints, append the number of bytes in the last int, reverse the int array and send as JSON: `[2, 26368, 1862299392, 1811965696, 771777792, 1862296320, 4294864128]`. It's super-effective, and I haven't had a single complaint about the service. – Mr. B Sep 27 '18 at 00:47
  • ..by which I mean, "Huh. I didn't know that. But that could be very useful in the future as we continue to define different symbol systems like with unicode." – Mr. B Sep 27 '18 at 00:50
  • "such as the underscore appearing in a host name." No, underscores can never appear in an hostname. They can appear in a "domain name" per the RFC1034/1035 definitions because as you said correctly the DNS is 8-bit clean, but some records, like A/AAAA restrict what they can be used for, they can be used for hostnames, not domain names, and hostnames are LDH (letters digits hyphens) only. – Patrick Mevzek Feb 12 '20 at 16:49
  • @PatrickMevzek I assure you, I can make a DNS server return anything I want in an A record. Or rather I should say, I can make it return an A record for any name lookup. – Zan Lynx Feb 12 '20 at 19:01
  • "I can make a DNS server return anything I want in an A record." Certainly not a compliant one then. The RDATA of an `A` record is an IP address. Nothing else. "Or rather I should say, I can make it return an A record for any name lookup" Then you are not conforming to RFC1034/1035. Hence you are not doing "DNS" but your own protocol... And you won't interoperate with any other DNS software. But yes everyone is free to do anything in private, even not following standards... And it works! In private. Not when there is a need of interoperability, this is why there are standards. – Patrick Mevzek Feb 12 '20 at 20:34
  • @PatrickMevzek You might be interested to know that standard TCP/IP tools like Ping and Telnet pass any name straight to the system resolver. And that resolver passes it to DNS. Even if the name does not follow the Internet host rules. Because LANs have always been little worlds of their own that do not follow Internet standards. – Zan Lynx Feb 15 '20 at 21:29
  • Once you speak about what you send, and once about the reply. You ought to decide at some point about what you are talking exactly. Even if you dislike it or are sure to be able to do things differently, the standard says an `A` record matches an hostname to an IPv4 address, and anything not doing that is just not following the standard, and hence is not interoperable. That is all. An underscore can indeed appear in a name, even in the DNS, but not in an hostname. See RFC 1034 and 1035 for definitions. Of the DNS standard. – Patrick Mevzek Feb 15 '20 at 22:56
1

Valid URL host include ascii letters, numbers, the dot ( . ) and the hyphen ( - ) with max length 255 with dot separated labels with max length 63. The hyphen can delimit alphanumeric sequences e.g. one-two.net but cannot appear at the beginning or end of a dot separated label e.g. -one.two.com, one.two.com- or one-.two.com are invalid host.

See https://tools.ietf.org/html/rfc1123#page-79 and Assumptions part 1 of https://tools.ietf.org/html/rfc952

Also this is a link to an online regex tool to validate URL host which worked as of 5/28/2019 https://www.regextester.com/23

Also when validating a host referencing https://tools.ietf.org/html/rfc1123#page-13 you should check the host syntactically for a dotted-decimal number before looking it up in the DNS.

shane
  • 21
  • 3
0

If you want to write URL-parsing code that perfectly matches the official W3C spec, see the document at www.w3.org/TR/url-1/ . See section 3 (Hosts) for specific information on hosts in URLs.

Chad
  • 1,245
  • 1
  • 11
  • 26