3

We read an important parameter as vm argument and it is a path to a file. Now, users are using vm argument with some korean characters (folders have been named with korean characters) and the program started to break since the korean characters are read as question marks! The below experiment shows the technical situation.

I tried to debug a program in eclipse and in "Debug Configurations" under "arguments" tab in "VM arguments", I gave the input like this

-Dfilepath=D:\XXXX\카운터

But when I read it from the program like this

String filepath = System.getProperty("filepath");

I get the output with question marks like below.

D:\XXXX\???

I understand that eclipse debug GUI uses the right encoding (?) to display the right characters, But when the value is read in program it uses different encoding which is not able to read the characters properly.

what is the default encoding does java uses to read vm arguments supplied to it?

How to change the encoding in eclipse so that the program reads the characters properly ?

  • How do you know the property's value is `D:\XXXX\???`? How are you examining/displaying it? – VGR Sep 15 '15 at 15:01
  • 1
    1. I printed in console, it was question mark 2. I was able to see the value of the variable from the eclipse debug perspective 3. I confirmed by manually typing string "D:\XXXX\???" and equalled it with what i was getting in _filepath_ variable – Senthilkumar Annadurai Sep 15 '15 at 15:21
  • I have the same issue when I try to run in Windows (from Cygwin; cmd.exe won't allow multibyte characters). Not surprisingly, it works just fine in Linux. I believe http://stackoverflow.com/questions/7660651/passing-command-line-unicode-argument-to-java-code addresses this, but doesn't really provide an answer, just some explanation. As a workaround, I would consider URL-encoding the path, and having the code pass it through [URLDecoder](http://docs.oracle.com/javase/8/docs/api/java/net/URLDecoder.html). – VGR Sep 15 '15 at 15:55

1 Answers1

2

My conclusion is the conversion depended on default encoding(Windows setting "Language for non-Unicode programs") Here is the program for testing:

package test;
import java.io.FileOutputStream;
public class Test {
public static void main(String[] args) throws Exception {
    StringBuilder sb = new StringBuilder();
    sb.append("[카운터] sysprop=[").append(System.getProperty("cenv"));
    if (args.length > 0) {
        sb.append("], cmd args=[").append(args[0]);
    }
    sb.append("], file.encoding=").append(System.getProperty("file.encoding"));
    FileOutputStream fout = new FileOutputStream("/testout");
    fout.write(sb.toString().getBytes("UTF-8"));
    fout.close();//write result to a file instead of System.out
    //Thread.sleep(10000);//For checking arguments using Process Explorer
}
}

Test1: "Language for non-Unicode programs" is Korean(Korea)

Exceute in command prompt: java -Dcenv=카운터 test.Test 카운터(Korean chars are correct when I verify the arguments using Process Explorer)

Result: [카운터] sysprop=[카운터], cmd args=[카운터], file.encoding=MS949

Test2: "Language for non-Unicode programs" is Chinese(Traditional, Taiwan)

Exceute in command prompt(paste from clipboard): java -Dcenv=카운터 test.Test 카운터(I cannot see Korean chars in command windows. However, Korean chars are correct when I verify the arguments using Process Explorer)

Result: [카운터] sysprop=[???], cmd args=[???], file.encoding=MS950

Test3: "Language for non-Unicode programs" is Chinese(Traditional, Taiwan)

Launch from Eclipse by setting Program arguments and VM arguments (The command line in Process Explorer is C:\pg\jdk160\bin\javaw.exe -agentlib:jdwp=transport=dt_socket,suspend=y,address=localhost:50672 -Dcenv=카운터 -Dfile.encoding=UTF-8 -classpath S:\ws\wtest\bin test.Test 카운터 This is the same as you see in the Properties dialog of Eclipse Debug view)

Result: [카운터] sysprop=[???], cmd args=[bin], file.encoding=UTF-8

Change the Korean chars to "碁石",which exist in MS950/MS949 charset:

  • Test1 Result: [碁石] sysprop=[碁石], cmd args=[碁石], file.encoding=MS949
  • Test2 Result: [碁石] sysprop=[碁石], cmd args=[碁石], file.encoding=MS950
  • Test3 Result: [碁石] sysprop=[碁石], cmd args=[碁石], file.encoding=UTF-8

Change the Korean chars to "鈥焢",which exist in MS950 charset:

  • Test1 Result: [鈥焢] sysprop=[??], cmd args=[??], file.encoding=MS949
  • Test2 Result: [鈥焢] sysprop=[鈥焢], cmd args=[鈥焢], file.encoding=MS950
  • Test3 Result: [鈥焢] sysprop=[鈥焢], cmd args=[鈥焢], file.encoding=UTF-8

Change the Korean chars to "宽广",which exist in GBK charset:

  • Test1 Result: [宽广] sysprop=[??], cmd args=[??], file.encoding=MS949
  • Test2 Result: [宽广] sysprop=[??], cmd args=[??], file.encoding=MS950
  • Test3 Result: [宽广] sysprop=[??], cmd args=[??], file.encoding=UTF-8
  • Test4: to verify my assumption, I change "Language for non-Unicode programs" to Chinese(Simplified, PRC) and exceute java -Dcenv=宽广 test.Test 宽广 in command prompt

    Result: [宽广] sysprop=[宽广], cmd args=[宽广], file.encoding=GBK

During testing, I always check the command line via Process Explorer, and make sure all chars are correct. However, the command argument chars are converted using default encoding before invoke main(String[] args) of Java class. If one of char does not exist in the charset of default encoding, the program will get unexpected argument.

I'm not sure the problem is caused by java.exe/javaw.exe or Windows. But passing non-ASCII parameter via command arguments is not a good idea.

BTW, I also try to execute the command via .bat file(file encoding is UTF-8). Maybe someone is interest,

Test5: "Language for non-Unicode programs" is Korean(Korea)

The command line in Process Explorer is java -Dcenv=移댁슫?? test.Test 移댁슫??(The Korean chars are collapsed)

Result: [카운터] sysprop=[移댁슫??], cmd args=[移댁슫??], file.encoding=MS949

Test6: "Language for non-Unicode programs" is Korean(Korea)

Add another VM arguments. The command line in Process Explorer is java -Dfile.encoding=UTF-8 -Dcenv=移댁슫?? test.Test 移댁슫??(The Korean chars are collapsed)

Result: [카운터] sysprop=[移댁슫??], cmd args=[移댁슫??], file.encoding=UTF-8

Test7: "Language for non-Unicode programs" is Chinese(Traditional, Taiwan)

The command line in Process Explorer is java -cp s:\ws\wtest\bin -Dcenv=儦渥?? test.Test 儦渥??(The Korean chars are collapsed)

Result: [카운터] sysprop=[儦渥??], cmd args=[儦渥??], file.encoding=MS950

Beck Yang
  • 2,874
  • 2
  • 17
  • 25