9

We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?

EDIT (in response to Maxym's answer): The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?

rkg
  • 5,173
  • 7
  • 32
  • 47
  • Well, if you work on i18n site, then that means that you will translate your site into few languages... That means that you expect people from other countries, what makes your site not that local :) Of course I can be wrong, if your people in your country uses few languages, specific for your country only – Maxym Jan 13 '11 at 17:57
  • if you use an non ASCII how will people type in with a standard keyboard??? – wenn32 Jan 13 '11 at 17:41
  • Their standard keyboard handles it. – Broam Jan 13 '11 at 17:55

4 Answers4

7

It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:

the data should first be encoded as octets according to the UTF-8 character encoding; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. (...) For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.

Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.

Fred Foo
  • 328,932
  • 68
  • 689
  • 800
  • Is it possible to do this conversion inside JavaScript? Does it have a built in functionality for this? – Shayan Nov 29 '19 at 12:46
4

I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...

Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm

Maxym
  • 11,518
  • 3
  • 42
  • 47
  • @maxym what if he wants only russians to see the site? – JOE SKEET Jan 13 '11 at 17:35
  • @herrow: in russian you can use translitaration... well, no idea how to spell this word in English, its meaning: cyrillic words written with latin letters (or sometimes even numbers are used). Many people use it here just at chatting (terrible to read, but they can't help doing that) – Maxym Jan 13 '11 at 17:43
  • Thanks Maxym! I am not worried about the world wide public, the site is going to be pretty local to a country. – rkg Jan 13 '11 at 17:45
  • 1
    @Ravi, sometimes we travel :) And when I travel 99% that I have no access to cyrillic keyboard. Also thinking about "my site is pretty local" is not thinking about future. Today you think so, tomorrow you will achieve more :) Be simple and flexible! – Maxym Jan 13 '11 at 17:47
  • Sorry, just complete sentence from my prev. comment: "Today you think so, tomorrow you will _be willing to_ achieve more" – Maxym Jan 13 '11 at 17:55
  • @maxym what if the russian doesnt know english letters/ – JOE SKEET Jan 13 '11 at 19:46
  • @JOE SKEET (you've changed your name from herrow?) well, I'm ukrainian, and I don't know anybody who does not know latin letters. I take into account even 6 years old children. In school we study either English or German or French, sometimes even few. The same in Russia... And we have a lot of stuff from abroad, so children are always interesting how to read its name etc. I'm sure you have never been to either Ukraine or Russia, or Belorussia etc ;) – Maxym Jan 13 '11 at 19:59
  • @maxym if u are in cambodia, you are not learning english, you are learning how to stay alive – JOE SKEET Jan 13 '11 at 22:54
2

depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.

I was just wondering what are the complications of having the non-ascii characters in the URL.

but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.

for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.

NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...

http://en.wikipedia.org/wiki/Percent-encoding

P M
  • 817
  • 1
  • 10
  • 17
1

You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this: http://www.w3schools.com/tags/ref_urlencode.asp