0

Is the "&" symbol allowed in the PATH segment of an URL or should be escaped?

According to nu w3c validator (https://validator.w3.org/nu/) I got:

Error: & did not start a character reference. (& probably should have been escaped as &.)
At line 407, column 52
<a href="/Bags-&-Purses/c/wome

However if I try to encode the URL via Java URI class I got all spaces and etc encoded but not the & symbol.

URI u = new URI(request.getScheme(), null,
                            request.getServerName(), request.getServerPort(),
                            request.getContextPath() + url,
                            query, null);
u.toURL().toString();

Where url string was : /Bags-&-Purses/c/womens-accessories-bags

The result is : https://localhost:8112/storefront/Bags-&-Purses/c/womens-accessories-bags - not encoded

The question is why the & is not escaped.. is this valid ? I guess it should be escaped with %26 but it looks it doesn't get escaped.

JOKe
  • 1,532
  • 1
  • 16
  • 28

1 Answers1

1

&, while a reserved character, seems to be a a valid character for the path segment in an URI. If you look at the grammar given for the path segment in RFC3986, section 3.3, & is allowed as part of the sub-delims group:

  path          = path-abempty    ; begins with "/" or is empty
                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters

  path-abempty  = *( "/" segment )
  path-absolute = "/" [ segment-nz *( "/" segment ) ]
  path-noscheme = segment-nz-nc *( "/" segment )
  path-rootless = segment-nz *( "/" segment )
  path-empty    = 0<pchar>

  segment       = *pchar
  segment-nz    = 1*pchar
  segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                ; non-zero-length segment without any colon ":"

  pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

(...)

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                / "*" / "+" / "," / ";" / "="

While you're asking about URLs and not the more general URIs, as far as I'm able to tell, an URL does not pose extra restrictions to the path segment. Section 2.2 of the same RFC then goes on to state that reserved characters should be percent-encoded, unless they're specifically allowed in that component. But for this case, all the characters in sub-delims group (& included) seem to be specifically allowed in the path segment, as per the grammar above.

However, the issue you're having here is not related to the URL itself, but with its textual representation when included in an HTML document. An ampersand cannot show up alone in HTML and must always be encoded. Related question: Do I really need to encode '&' as '&amp;'?

Community
  • 1
  • 1
mpontes
  • 2,677
  • 1
  • 16
  • 22