3

I'm trying to encode a string that has characters not supported by the target encoding (CP 1047).

Is there a standard/common/easy way of mapping those characters to a cp1047 equivalent?

For example, the text has a fancy double quote character () and I want to convert it to the straight double quote (").

Obviously I could just do the replace in my code, but is their a better way? Is there an open source tool, or API out there that I don't know about?

Jonathan Leffler
  • 666,971
  • 126
  • 813
  • 1,185
tom
  • 394
  • 1
  • 9
  • 4
    That vast majority of Unicode characters don't _have_ CP1047 equivalents. – SLaks Aug 16 '11 at 18:56
  • 1
    This question http://stackoverflow.com/questions/4808967/replacing-unicode-punctuation-with-ascii-approximations points to this web page with a decent looking conversion table http://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html – clstrfsck Aug 17 '11 at 00:40
  • 2
    It is far easier, and infinitely preferable, to upgrade your legacy encodings to Unicode than to downgrade Unicode into the musty old rotting boxes of legacy encodings. – tchrist Aug 17 '11 at 02:46

1 Answers1

2

If you want to encode Unicode characters in EBCDIC (CP 1047), then (apparently) there's UTF-EBCDIC (though I don't know of any existing tools that can convert to that).

Alternatively, I would look into using the non-standard form of Percent-encoding or XML/HTML encoding. Either one of these two encodings would probably have existing tools for encoding (such as Commons Lang StringEscapeUtils).

Finally, if you just want to 'map' extended characters into the CP 1047 space then I guess you're left with scanning the source string character by character and building the result string from a Map<Char, Char> (or Map<Char, String>), so long as you know beforehand all the extended characters you have to deal with and their desired equivalents/replacements.

Alistair A. Israel
  • 6,129
  • 1
  • 28
  • 40