0

I've a simple web service that lists a variable number of foreign languages.

Some of them are listed in native charset (like Chinese, for example).

I must read this from a webpage and dynamically add them to a JComboBox.

Actually I'm reading them in this way:

public static Vector getSiteLanguages() {
    System.out.println("Reading Home from " + Constants.HOME);
    URL url;
    URLConnection connection;
    BufferedReader br;
    String inputLine;

    String regEx = "<option.*value=.([A-Z]*).>(.*)</option>";
    Pattern pattern = Pattern.compile(regEx);       
    Matcher m;
    Vector siteLangs = new Vector(); 

    try {
        url = new URL( Constants.HOME);
        connection = url.openConnection();
        br = new BufferedReader(new InputStreamReader(connection.getInputStream()));

        while ((inputLine = br.readLine()) != null) {
            m = pattern.matcher(inputLine);
            while ( m.find()) {
                System.out.println(m.group(1) + "->" + m.group(2) );
                siteLangs.add(m.group(2));
            }
        }
        br.close();
    } catch (IOException e) {
        return siteLangs;
    } 

    return siteLangs;       
}

Then in the JFrame class I'm doing this:

Vector siteLangs = Language.getSiteLanguages();
JComboBox siteLangCombo = new JComboBox(siteLangs);

But in this way all non-latin languages are lost...

How do I preserve non-latin info in this situation?

BalusC
  • 992,635
  • 352
  • 3,478
  • 3,452
realtebo
  • 19,593
  • 31
  • 85
  • 151

1 Answers1

0

The InputStreamReader uses by default the platform default character encoding to convert bytes to characters. The website is apparently using a different character encoding to convert characters to bytes in the HTTP response. You need to check the HTTP Content-Type response header which one it is.

String contentType = connection.getHeaderField("Content-Type");

Assuming that it's UTF-8, which is these days the most commonly used character encoding in websites who strive to world domination, here's how you should be specifying it during the construction of the InputStreamReader in your code:

br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));

See also:


Unrelated to the concrete problem, the Vector is a legacy class which has been replaced by the List interface since 1998. Are you sure that you're reading up-to-date resources during your Java learning spree? Further, regex should not be your first choice when you just need to parse HTML. This is Java, not PHP. Use a normal HTML parser. You may find Jsoup helpful in this. The whole code which you've so far can then be brought back to two or three lines.

Community
  • 1
  • 1
BalusC
  • 992,635
  • 352
  • 3,478
  • 3,452
  • Adding the 'UTF-8' I'm able to show about 90% of listed languages. Probably the original charset is different from UTF-8. I'll investigate about the header – realtebo May 19 '12 at 19:36
  • Note that you also need to take into account that the output wherever you're displaying/printing the information is also using UTF-8 for that. – BalusC May 19 '12 at 19:37
  • Thans for info about Vector. I used it only because the api tell me that JComboBox can use Vector to initialize and I cannot know the exact number of element I need, because it's varying about every day. I'll study a lot about this. – realtebo May 19 '12 at 19:38
  • "the output wherever you're displaying/printing the information is also using UTF-8" --- and for uncovered languages ? How to ? – realtebo May 19 '12 at 19:39
  • Questo come appare il codice sorgente (su cui non ho però alcun controllo) da cui leggo questi tag sulla lingua http://img217.imageshack.us/img217/1844/cattura1k.jpg Questo come appare la combobox su windows 7 http://img838.imageshack.us/img838/2760/cattura2ug.jpg Se vi servono altre info, ditemi pure. – realtebo May 19 '12 at 19:44
  • OK then you should indeed read it as UTF-8. As to displaying, Java/Swing by default already uses Unicode, so that part is fine. If you're still seeing boxes for some specific languages, then your next problem is that you need to specify a font which supports those characters. For example, Arial. But that's a different problem. Your problem of reading the proper data has been solved. By the way, I do not understand Italian. Please write English. – BalusC May 19 '12 at 19:49
  • Sorry... first image was source code, second image how it's showing the combo now. I tried Arial font, but 9 language name are showed like a box, not only 2, i'm looking at source font file ... – realtebo May 19 '12 at 19:56
  • If you're seeing a box instead of [mojibake](http://en.wikipedia.org/wiki/Mojibake), then it means that the used font doesn't have any glyph (an image representing the character) for the character. You need to specify a different font or to supply a font which contains glyps for those characters. If you were seeing mojibake, then the character encoding would indeed have been wrong (but this is thus not the case). – BalusC May 19 '12 at 19:58
  • Firebug tells me that computed font-family was Arial, but Arial has not some of needed chars... I don't understand... – realtebo May 19 '12 at 20:00
  • No, your problem is in the Swing side. The browser font is totally irrelevant, it isn't been used at all during transferring the bytes and encoding the characters. Your browser font is only used when you use this browser to view the page. Your concrete problem is that your Swing application must use a font which supports those characters. Feel free to ask a new question focused on displaying specific characters in Swing. Do NOT include the code for retrieving the webpage. Just copypaste the string and hardcode it in Java. That's easier problem solving. – BalusC May 19 '12 at 20:02
  • I don't understand: if browser use Arial, my pc has a complete set of glyphs in Arial, or not ? Why Swing cannot use Arial like the browser does ? Probably I'll simply exclude this 2 languages to go on with the project. – realtebo May 19 '12 at 20:06
  • I also don't understand, but I have also never really used Swing. I am a Java EE web developer. I was just answering how to retrieve the characters the right way. That part is solved now. – BalusC May 19 '12 at 20:07