How to get all unicode characters of a language by using its ISO language code in Python?

Question

For example, the ISO language code of German language is de.
How do I get all unicode characters of that language in Python?

If that's not directly possible, how about the following:
Given an ISO language code (say de),

How do I find the script name given an ISO language code?

(For example, script used for German is Latin)

>>> import unicodedata as ud
>>> ud.name('ß')
'LATIN SMALL LETTER SHARP S'

Now using this script name, how do I get all the unicode characters of that script?

Some languages are regularly written using more than one script, either because scripts are interchangeable or because the written language consists of more than one script. (Does that even include Arabic numerals in regular Latin script…?) Is "ï" an "English" character…? You wouldn't usually say so, yet "naïve" can be written using it in regular English. So… this is all rather vaguely defined… — deceze, Jun 29 '20 at 12:00
Even Germans themselves can't agree whether their alphabet consists of 26 letters and 4 "special letters" or 30… https://en.wikipedia.org/wiki/German_orthography#Special_letters What problem exactly are you trying to solve with this? — deceze, Jun 29 '20 at 12:10
Okay, my use-case is not exactly for the German language. It's a bit more generic. For instance, if you take the language "Hindi", the script it uses is called `Devanagari` and the characters are completed fixed (unlike the German case). So, what I want to solve is, to generate random strings for any given language. For example, if I specify `Simplified Chinese`'s language code, I need to get all the language's characters using which I can generate random strings. It need not cover all the languages of the world, even just the python unicode standardized ones would do. — Gokul NC, Jun 29 '20 at 12:37
Something like https://mimesis.readthedocs.io or https://faker.readthedocs.io…? — deceze, Jun 29 '20 at 12:41
Yes, maybe something like that! But seems like many languages are not yet supported, mainly the ones I'm targeting my app for. And I don't even need a library like that, just the list of alphabets given any language code would be more than enough. — Gokul NC, Jun 29 '20 at 13:03

score 1 · Answer 1 · answered Jun 29 '20 at 15:26

1

The Unicode CLDR project compiles the information you're looking for (and more besides). For example, in the CLDR data for 'de' German (link is to data in the latest release), see the first row in the data. Python makes use of some CLDR data (e.g., for regex patterns by character properties), but probably not this particular data. Look for a library that provides support for CLDR exemplars by language/locale.

answered Jun 29 '20 at 15:26

Peter Constable

994
5
15

Thanks! *`Look for a library that provides support for CLDR exemplars by language/locale.`* - Please let me know if any. I didn't find any. – Gokul NC Jun 30 '20 at 05:46

How to get all unicode characters of a language by using its ISO language code in Python?

1 Answers1