Java- Converting from unicode to ANSI

Question

I have a string \u0986\u09AE\u09BF \u0995\u09BF\u0982\u09AC\u09A6\u09A8\u09CD\u09A4\u09BF\u09B0 \u0995\u09A5\u09BE \u09AC\u09B2\u099B\u09BF. I need to convert it in Avwg wKsewš—i K_v ejwQ` which is in ANSI format. How can I convert this Unicode to ANSI characters in java.

Edit:

resultView.setTypeface(typeFace);
String str=new String("\u0986\u09AE\u09BF \u0995\u09BF\u0982\u09AC\u09A6\u09A8\u09CD\u09A4\u09BF\u09B0 \u0995\u09A5\u09BE \u09AC\u09B2\u099B\u09BF");               
resultView.setText(str);

but what is resultView ? what is typeFace ? is it java in Android? — Marek Sebera, Oct 30 '11 at 07:57

score 6 · Answer 1 · answered Oct 30 '11 at 13:38

6

I need to convert it in AvwgwKsewš—i K_v ejwQ which is in ANSI format.

That's not ANSI format. The (misleadingly-named) "ANSI" code pages in Windows are all based around ASCII, with different characters added in the high bytes. Byte 0x41 (A) as a leading letter in an ANSI code page always means Latin A and not Bengali আ.

What I think you have is a custom symbol font, that maps arbitrary symbols to completely unrelated codepoints. Every such font has its own visual encoding; to convert between Unicode and the custom visual encoding you'd have to build up your own translation table by looking at the glyphs for each character and matching them to the Unicode character that represents the same letter.

I would strongly advise getting a proper Unicode-aware font that supports Bengali instead. Content stuck in an arbitrary font-specific encoding is difficult to deal with (because semantically you really are dealing with a string that means "AvwgwKsewš—i K_v ejwQ", with all the editing and case-changing gotchas that implies.

Visual-encoded fonts are an unhappy relic of the time before Windows had good Unicode (or even ISCII) support. They should not be used for anything today.

answered Oct 30 '11 at 13:38

bobince

498,320
101
621
807

Great answer. But I question whether Windows really has good Unicode support. People have a lot of trouble with using UTF-8 on the console, with transparently using code points from all planes, with generating UTF-8 streams without spurious BOMs, etc etc etc. – tchrist Oct 31 '11 at 01:06
@tchrist: certainly UTF-8 is a second-class citizen under Windows, with the default C stdlib handling of “code page 65001” being seriously broken (which is one of the reasons the Command Prompt can't do Unicode very well, but then the Command Prompt is a horrible relic). The rest of it isn't bad; in most of the front-end stuff non-BMP characters typically work fine. – bobince Oct 31 '11 at 13:07
1

It's a shame that Windows got bitten by the Curse Of UTF-16, and especially that it keeps the locale-specific “ANSI” code page as the default for narrow string handling even to this day (instead of UTF-8 which every other modern OS uses). But you can see how it happened given the history, with NT being designed before the invention of UTF-8. Back then the Unicode guys really did think that everyone was going to be moving to handling character IO with two-byte code units at all times. Microsoft were trying to be modern, and Windows got trapped with a standard that wasn't quite ready. – bobince Oct 31 '11 at 13:09

laher · Answer 2 · 2011-10-30T10:59:11.367

I'm not sure exactly what you're asking, but I'll assume you're asking how to convert some characters from Unicode into an 8-bit character set. (e.g. ISO-8859-1 is the characterset for 'Western European' languages, like English).

I don't know of any way to automatically detect the relevant 8-bit charset, so I looked up one of your characters (on here http://unicode.org/charts/ ), and I can see that these characters are Bengali.

I think the equivalent 8-bit character set for Bengali is known as x-iscii-be. I don't have this installed on my system, so I couldn't do the conversion successfully.

EDIT: Java does not support the charset x-iscii-be, but I'll leave the remainder of this answer for illustration purposes. See http://download.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html for a list of supported Charsets.

EDIT2: Android certainly doesn't guarantee support for this charset (the only 8-bit characterset it guarantees is ISO-8859-1). See: http://developer.android.com/reference/java/nio/charset/Charset.html .

*So, I think you should run some Charset-detecting code on a Bengali Android device - perhaps it supports this charset. Everything you need is in my code sample. *

In order for Java to convert your data in a different charset, all you need to do in Java is to check that the desired Charset is installed, and then specify the desired Charset when you convert the String into bytes.

The conversion itself would be extremely simple:

    str.getBytes("x-iscii-be");

So, you see, the String itself is stored in a kind of 'normalised' form (i.e. the defaultCharset), and you can treat the getBytes(charsetName) as kind of 'alternative output format' for the String. Sorry - poor explanation!

In your situation, perhaps you just need to assign a Charset to the resultView, and the framework will work its magic for you ...

Here's some test code I put together to illustrate the point, and to check whether a given charset is supported on a system.

I have got this code to output the byte-arrays as 'hex' strings, so that you can see that the data is different after conversion.

import java.io.UnsupportedEncodingException;
import java.math.BigInteger;
import java.nio.charset.Charset;
import java.util.Map.Entry;
import java.util.SortedMap;

public class UnicodeTest {
    public static void main(String[] args) throws UnsupportedEncodingException {
        testWestern();
        testBengali();
    }

    public static void testWestern() throws UnsupportedEncodingException {
        String unicodeStr= "\u00c2"; //This is a capital A with an accent.;
        String charsetName= "ISO-8859-1";
        System.out.println("Input (outputted as default charset - normally unicode): "+unicodeStr);
        attempt8bitCharsetConversion(unicodeStr, charsetName);
    }

    public static void testBengali() throws UnsupportedEncodingException {
        String unicodeStr = "\u0986\u09AE\u09BF \u0995\u09BF\u0982\u09AC\u09A6\u09A8\u09CD\u09A4\u09BF\u09B0 \u0995\u09A5\u09BE \u09AC\u09B2\u099B\u09BF";
        String charsetName= "x-iscii-be";
        System.out.println(unicodeStr);
        attempt8bitCharsetConversion(unicodeStr, charsetName);
    }

    public static void attempt8bitCharsetConversion(String input, String charsetName) throws UnsupportedEncodingException {
        SortedMap<String, Charset> availableCharsets = Charset
                .availableCharsets();
        for (Entry<String, Charset> entry : availableCharsets.entrySet()) {
            if (charsetName.equalsIgnoreCase(entry.getKey())) {
                System.out.println("HEXED input : "+ toHex(input.getBytes(Charset.defaultCharset().name())));
                System.out.println("HEXED output: "+ toHex(input.getBytes(entry.getKey())));
            }
        }
        throw new UnsupportedEncodingException(charsetName+ " is not supported on this system");
    }

    public static String toHex(byte[] input) throws UnsupportedEncodingException {
        return String.format("%x", new BigInteger(input));
    }
}

See also here for more information on charset conversion: http://download.oracle.com/javase/tutorial/i18n/text/string.html

Charactersets are a tricky business, so please forgive my convoluted answer.

HTH

Ah. I assumed that your Bengali OS would support this character set. But, on further reading, (Oracle) Java just doesn't support it. See here for the list of supported character encodings: http://download.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html — laher, Oct 30 '11 at 10:42
I just checked the android docs, and Android itself only guarantees availability of unicode + ascii + ISO-8859-1. Perhaps some phones made for Bengali markets do support it? I suggest you try to run some code on the target phone, to check if it supports the charset — laher, Oct 30 '11 at 10:57
ok. If it doesnt support the charset, you're going to need to find a typeface that supports Unicode instead. Even if it does, you should look into it anyway. Good luck — laher, Oct 30 '11 at 11:49
ya..actually I first tried with unicode, But in bangla language there are some conjugate character. To overcome this problem I tried this. — Abdullah Md. Zubair, Oct 30 '11 at 13:05

score 0 · Answer 3 · answered Feb 09 '12 at 08:04

I've written a class which can solve the problem of 09CB ো, 09CC ৌ, 09C7 ে, 09C8 ৈ,09BF ি ্য,্র,ৃ in UTF-8, I reshape it by editing font glyph, you don't need to change it to extended ASCII, :( but still i couldn't solve your bengali conjugates. For proper render it require android 3.5 or higher, it'll work smooth on android 4.0 (Ice Cream Sandwich).

Java- Converting from unicode to ANSI

3 Answers3

Linked