8

I'm having a hard time figuring out how to handle this problem:

I'm developing a web tool for an Italian university, and I have to display words with accents (such as è, ù, ...); sometimes I get these words from a PostgreSql table (UTF8-encoded), but mostly I have to read long passages from a file. These files are encoded as utf-8 xml, and display fine in Smultron or any utf-8 editor (they were created parsing in python old files with entities such as è instead of "è").

I wrote a java class which extracts the relevant segments from the xml file, which works like this:

String s = parseText(filename, position)

if I write the returned String to a file, everything looks fine; the problem is that if I do

out.write(s)

in the jsp page, I get strange characters. By the way, I use

String s = getWordFromPostgresql(...)

out.write(s)

in the very same jsp and it displays OK.

Any hint?

Thanks Nicola


@krosenvold

Thanks for your response, however that directive is already in the page, but it doesn't work (actually it "works" but only for the strings I get from the database). I think there's something about reading from the files, but I can't understand ... they work in "java" but not in "jsp" (can't think about a better explanation ...)

here's a basic example extracted from the actual code: the method to read from the files return a Map, from a Mark (an object representing a position in the text) to a String (containing the text):

this is in the .jsp page (with the utf-directive cited in the posts above)

    // ...
    Map<Mark, String> map = TestoMarkParser.parseMarks(...);
    out.write(map.get(m));

and this is the result:

"Fu però così in uso il Genere Enharmonico, che quelli quali vi si esercitavano,"

if I put the same code in a java class, and substitute out.write with System.out.println, the result is this:

"Fu però così in uso il Genere Enharmonico, che quelli quali vi si esercitavano,"


I've been doing some analysis with an hex editor, here it is:

original string: "fu però così "

ò in xml file: C3 B2

ò as rendered by out.write() in the jsp file: E2 88 9A E2 89 A4

ò as written to file via:

FileWriter w = new FileWriter(new File("out.txt"));
w.write(s);     // s is the parsed string
w.close();

C3 B2

printing the values of each character as an int

0: 70 = F
1: 117 = u
2: 32 =  
3: 112 = p
4: 101 = e
5: 114 = r
6: 8730 = � 
7: 8804 = � 
8: 32 =  
9: 99 = c
10: 111 = o
11: 115 = s
12: 8730 = �
13: 168 = �
14: 10 = `
Community
  • 1
  • 1
nicolamontecchio
  • 103
  • 1
  • 1
  • 6
  • This is great question for UTF-8 and Java http://stackoverflow.com/questions/138948/how-to-get-utf-8-working-in-java-webapps – Sergio del Amo Jun 24 '09 at 15:45

4 Answers4

15

In the jsp page directive you should try setting your content-type to utf-8, which will set the pageEncoding to utf-8 also.

<%@page contentType="text/html;charset=UTF-8"%>

UTF-8 is not default content type in jsp, and there are all sorts of interesting problems that arise from this. The problem is that the underlying stream is interpreted as an ISO-8859-1 stream by default. If you write some unicode bytes to this stream, they will be interpreted as ISO-8859-1. I find that setting the encoding to utf-8 is the best solution.

Edit: Furthermore, a string variable in java should always be unicode. So you should always be able to say

System.out.println(myString) 

and see the proper character set coming in the console window of your web-server (or just stop in the debugger and examine it). I suspect that you'll be seeing incorrect characters when you do this, which leads me to believe you have an encoding problem when constructing the string.

Florent
  • 11,917
  • 10
  • 44
  • 56
krosenvold
  • 70,511
  • 27
  • 141
  • 205
5

I have some international jsp's [which have "special" international (with respect to English) characters].

Inserting this [and only this, i.e: no contentType directive also (that made a duplicate contentType error)] at the top of them got them to save and render correctly:

<%@page pageEncoding="UTF-8"%>

This reference [http://www.inter-locale.com/codeset1.jsp] helped me discover that.

cellepo
  • 2,827
  • 2
  • 31
  • 46
  • 1
    +1; removing the duplicate contentType in my included JSP fixed my issue. I think it's a bit weird that a duplicate page directive causes this incorrect behaviour though.. – SND Feb 15 '18 at 13:23
0

I had also the same problem, everything is "utf-8" and why i see
senseless characters and the problem was in jsp and it must be at the head of the page.

 <%request.setCharacterEncoding("utf-8");%>

and everything will be ok.

misman
  • 1,172
  • 2
  • 20
  • 35
0
String s = parseText(filename, position)

Where is this method defined? I'm guessing that it's your own method, which opens the file and extracts a particular chunk of the data. Somewhere in this process it's getting converted from bytes to characters, probably using the default encoding for your JVM.

If the default encoding of your running JVM doesn't match the actual encoding in the file, you're going to get incorrect characters in your string. Added to that, if you're reading content that is encoded in a multi-byte form (such as UTF-8), your "position" may point into the middle of a multi-byte encoding.

If the source files are in well-formed XML, you'll be much better off using a real parser (such as the one built into the JDK) to parse them, since the parser will provide the correct translation of bytes to characters. Then use an XPath expression to retrieve the values.

If you haven't used an XML parser in the past, here are two documents that I wrote on parsing and XPath.


Edit: one thing that you may find helpful is to print out the actual character values in the string, using something like the following:

public static void main(String[] argv) throws Exception
{
    String s = "testing\u20ac";
    for (int ii = 0 ; ii < s.length() ; ii++)
    {
        System.out.println(ii + ": " + (int)s.charAt(ii) + " = " + s.charAt(ii));
    }
}

You should probably also print out your default character set, so that you know how any particular sequence of bytes is translated to characters:

public static void main(String[] argv) throws Exception
{
    System.out.println(Charset.defaultCharset());
}

And finally, you should examine the served page as raw bytes, to see exactly what's being returned to the client.


Edit #2: the character ò is Unicode value 00F2, which would be UTF-8 encoded as C3 B2. These two codes doesn't correspond to the symbols that you showed in your earlier answer.

For more on Unicode characters, see the code charts at Unicode.org.

kdgregory
  • 36,474
  • 10
  • 73
  • 99
  • The parseText was an example method is defined by me; Yes I use SAX parsing, although I admit I'm new to xml handling; what I really can't understand is the difference when I use a java console or print out to a jsp page (the very same java String object is rendered differently...) I'm looking at your documents now, thanks for the reference. – nicolamontecchio Jan 28 '09 at 18:51
  • I examined the xml file with an hex editor, and I found out that the ò character is effectively encoded in the xml as C3 B2 ... – nicolamontecchio Jan 28 '09 at 20:29
  • I think I found out what was wrong; probably there is some conversion error when using the characters() method in the SAX parser. In fact the accented characters get encoded 'twice' (i.e. the utf-8 encoding of the utf-8 encoding). I switched to a simpler DOM parser (which handles by itself all these details) and the page works fine (thanks for your tutorial). – nicolamontecchio Jan 30 '09 at 14:20