Removing Extended Ascii with retention of text

Question

How can a covert an string containing the extended ascii to represent a number raised to a power while retaining the exponent value? For example, if the string is 'm\xb3/h' which is intended to represent cubic meters per hour, I would like to return the string 'm3/h'. Or 'm\xb2' should return 'm2'.

The code -

varUnit = 'm\xb3/h'
varUnit.decode('ascii', 'ignore').endcode('ascii')
print varUnit

returns 'm/h' while 'm3/h' is desired.

This is virtually always the wrong thing to do. The ASCII world is done. Welcome to Unicode! — tchrist, Jan 30 '11 at 22:56
@tchrist: Unfortunately, there are still a lot of legacy systems that won't accept anything beyond ASCII (or anything beyond whatever code page they prefer). It's not ideal, but there are plenty of situations where it's what you need to do. — Thomas K, Jan 30 '11 at 23:03
@thomas since this is Python the legacy argument does not apply — David Heffernan, Jan 30 '11 at 23:27
@Thomas: What’s a “code page”? Don’t you mean an encoding? — tchrist, Jan 30 '11 at 23:37
@David: Of course it does. Just because you're writing code in Python doesn't mean your data doesn't need to be used with a legacy system. In fact, Python is probably one of the best ways of preparing it if you do need to do that. — Thomas K, Jan 31 '11 at 13:03
@tchrist: It seems they mean pretty much the same thing. I was thinking of 8-bit, ascii-compatible encodings, like the iso-8859 family, which could all be called "extended ascii". — Thomas K, Jan 31 '11 at 13:10
@Thomas: “Extended ASCII” has no well-defined meaning. “ASCII compatible”, however, does. Both MacRoman and UTF-8 are ASCII compatible encodings, as are all the ISO 8859 encodings, Base32 and Base64, and UTF-7. There are [many other ASCII compatible encodings](http://www.inter-locale.com/IUC22.pdf) beyond these. Note that ASCII compatible is not the same as those with a one-to-one mapping between their code point values and ASCII’s. UTF-8 and MacRoman preserve this, whereas MIME-64 and UTF-7 do not. Perhaps *“backwards compatible with ASCII”* might work as a somewhat cumbersome phrase. — tchrist, Jan 31 '11 at 14:53
See also [this answer](http://stackoverflow.com/questions/4846365/find-characters-that-are-similar-glyphically-in-unicode/4846508#4846508). — tchrist, Jan 31 '11 at 16:01
@tchrist: '“Extended ASCII” has no well-defined meaning': That's essentially what I said! — Thomas K, Jan 31 '11 at 18:23

Thomas K · Answer 1 · 2011-01-30T23:05:53.820

Well, the first thing to know is that there is no one "extended ascii". Ascii has been extended in many different ways. A quick test suggests that you want "latin_1" or "cp1252". So, first, convert it to unicode (a way of storing any character at all):

varUnit = varUnit.decode("latin_1")

EDIT: If you just want to display it in your own application, you should stop here and use Unicode. print varUnit should give you m³. But legacy systems might not be able to handle it, in which case:

Then, you need to simplify it to characters that can be represented in pure ASCII. The easiest way is to use the unidecode module (you can install it using pip or easy_install):

from unidecode import unidecode
print unidecode(varUnit)

score 1 · Answer 2 · answered Jan 30 '11 at 23:08

1

The superscript digits have compatibility decompositions, so you can do:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'm²')
'm2'

answered Jan 30 '11 at 23:08

dan04

77,360
20
153
184

Removing Extended Ascii with retention of text

2 Answers2