10

How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()?

Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.

>>> for title in d.keys():
>>>     print("{:<20} | {}".format(title, d[title]))

    zootehni-           | zooteh.
    zootekni-           | zootek.
    zoothèque          | zooth.
    zooveterinar-       | zoovet.
    zoovetinstitut-     | zoovetinst.
    母                   | 母母

>>> s = 'è'
>>> len(s)
    2
>>> [ord(c) for c in s]
    [101, 768]
>>> unicodedata.name(s[1])
    'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
    1

As can be seen, str.format() simply takes the number of code-points in the string (len(s)) as its width, leading to skewed columns in the output. Searching through the unicodedata module, I have not found anything suggesting a solution.

Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".

Edit: Added info about normalization.

Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:

    zwemwater     | zwemw.
    zwia̢z-       | zw.

>>> s3 = 'a\u0322'   # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
    2
Christian Aichinger
  • 6,278
  • 2
  • 34
  • 56
  • No. Normalization does not fix this problem as a whole. It fixes it for combining characters in European languages. However, Asian characters often have larger display widths, again breaking str.format() and posing the question "What is the display width of the string". – Christian Aichinger Mar 06 '14 at 13:10
  • And [`unicodedata.east_asian_width`](http://docs.python.org/2/library/unicodedata.html#unicodedata.east_asian_width) doesn't help with that part? – Martijn Pieters Mar 06 '14 at 13:14
  • Note that it still depends on your console font if East Asian Width characters are actually displayed with a narrow or wide glyph and string formatting cannot help you there. – Martijn Pieters Mar 06 '14 at 13:17
  • Updated the answer. Please show me how to roll a solution for my problem out of ``unicodedata.east_asian_width()``. AFAICS it's not possible. E.g. ``s2 = unicodedata.normalize('NFC', s)`` gives "LATIN SMALL LETTER E WITH GRAVE" as desired. Then calling ``unicodedata.east_asian_width(s2)`` returns ``"A"``, which the documentation helpfully tells us is "ambiguous" - although it's display width is certainly 1. – Christian Aichinger Mar 06 '14 at 13:22
  • Note that I updated my question. I grant you that normalization fixes a part of the problem. But the whole purpose of Unicode is for code not to break once you throw exotic characters at it. I do not want a half-baked solution ("it's good enough for you"), so I don't believe this question should be closed at this point. Examples in other languages: C's ``wcswidth()``. – Christian Aichinger Mar 06 '14 at 13:26
  • For your `s2` as posted here I get `W`, wide. Unfortunately, *dislaying* all of Unicode is *not* consistent. It won't break, but string lengths are not the only factors here at play, your console font is also applicable. There are several problems in your post now, and it is rapidly becoming Too Broad because of that. – Martijn Pieters Mar 06 '14 at 13:29
  • 2
    Normalization doesn't fix the problem for my data set. I just found [kitchen.text.display](http://pythonhosted.org//kitchen/api-text-display.html), which is Python2-only, but seems to do exactly what I want. – Christian Aichinger Mar 06 '14 at 13:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/49150/discussion-between-christian-aichinger-and-martijn-pieters) – Christian Aichinger Mar 06 '14 at 13:43
  • 1
    Normalization and display width are different topics. Those who marked this question as duplicate are totally mistaken. – Walter Tross Nov 23 '19 at 18:50
  • Totally not a duplicate. I hope everyone on this site don't simply close questions they cannot answer! – tsh Dec 05 '20 at 09:11
  • To anyone got here: this question is actually what you want https://stackoverflow.com/questions/29776299/aligning-japanese-characters-in-python – tsh Dec 05 '20 at 09:15

1 Answers1

4

You have several options:

  1. Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though.

    Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image.

  2. Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table.

    Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue.

    Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths.

  3. Use the tab character (\t) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.

  4. Create a HTML file with a table and look at it in a browser.

Aaron Digulla
  • 297,790
  • 101
  • 558
  • 777
  • 1
    That trick on the Amiga was very lateral thinking. Breaking the printable characters up into patterns of just their very top row of pixels could cover close enough to pretty much everything you would need. Overprinting sounds fairly wasteful but clever if there's no other graphics option. What happens when you get to the bottom of the screen, I would imagine you'd have to leave a bottom row of whitespace, or could you print partially off screen? – Davos Sep 29 '17 at 13:28
  • @Davos You need to print a line of spaces. Printing past the bottom doesn't work because the terminal window would start scrolling. And the trick only works with the default font, obviously. – Aaron Digulla Sep 29 '17 at 15:34