3

I'm reading the messages from an email account by using JavaMail 1.4.1 (I've upgraded to 1.4.5 version but with the same problem), but I'm having issues with the encoding of the content:

POP3Message pop3message;
... 
Object contentObject = pop3message.getContent();
...   
String contentType = pop3message.getContentType();
String content = contentObject.toString();

Some messages are read properly, but others have strange characters because of a not suitable encoding. I have realized it doesn't work for a specific content type.

It works well if the contentType is any of these:

  • text/plain; charset=ISO-8859-1

  • text/plain;
    charset="iso-8859-1"

  • text/plain;
    charset="ISO-8859-1";
    format="flowed"

  • text/plain; charset=windows-1252

but it doesn't if it is:

  • text/plain;
    charset="utf-8"

for this contentType (UTF-8 one) if I try to get the encoding (pop3message.getEncoding()) I get

quoted-printable

For the latter encoding I get for example in the debugger in the String value (in the same way as I see it in the database after persisting the object):

Ubicación (instead of Ubicación)

But if I open the email with the email client in a browser it can be read without any problem, and it's a normal message (no attachments, just text), so the message seems to be OK.

Any idea about how to solve this issue?

Thanks.


UPDATE This is the piece of code I've added to try the function getUTF8Content() given by jlordo

POP3Message pop3message = (POP3Message) message;
String uid = pop3folder.getUID(message);

//START JUST FOR TESTING PURPOSES
if(uid.trim().equals("1401")){
    Object utfContent = pop3message.getContent();
    System.out.println(utfContent.getClass().getName()); // it is of type String
    //System.out.println(utfContent); // if not commmented it prints the content of one of the emails I'm having problems with.
    System.out.println(pop3message.getEncoding()); //prints: quoted-printable
    System.out.println(pop3message.getContentType()); //prints: text/plain; charset="utf-8"
    String utfContentString = getUTF8Content(utfContent); // throws java.lang.ClassCastException: java.lang.String cannot be cast to javax.mail.util.SharedByteArrayInputStream
    System.out.println(utfContentString);
}

//END TEST CODE
Javi
  • 17,777
  • 30
  • 97
  • 132
  • Where exactly do you see `Ubicación (instead of Ubicación)`? Console? Variable Inspector? I suspect everything is fine, but the debugger can't display utf-8 characters. – jlordo Nov 14 '12 at 19:02
  • @jlordo In the debugger of Eclipse I see that by watching what is inside the content variable. Also in the database, postgresql, if I do a select I get that result. – Javi Nov 14 '12 at 19:16
  • Do you read it from the db, or write it to the db and then read it out again? Is the db set up correctly? – jlordo Nov 14 '12 at 19:21
  • @jlordo I read the email, and then persist the data in the database. I'm persisting an entity with hibernate, but I'm detecting the encoding issue before, just after reading the message in the debugger. It's an application with over 2 years of development and it persists many entities without any problem of encoding, so it should be well configured. I find it really strange that it works with iso-8859-1 and windows-1252 but not with utf-8. – Javi Nov 14 '12 at 19:29
  • `so it should be well configured` is a an assumption that's `false` most of the times ;) I would first try to exclude the db as the root cause by manually adding something with special characters and reading it afterwards. If the db is ok, I hope the error is in your code and you can fix it, and not in a library you use. – jlordo Nov 14 '12 at 19:37
  • 1
    @jlordo How can it be a problem of the database if I detect the problem even before the data is persisted? – Javi Nov 14 '12 at 19:39
  • how can you be sure the problem exists before you persist? Do you know for sure the Eclipse debugger can correctly display UTF-8? – jlordo Nov 14 '12 at 19:42
  • 1
    @jlordo before persisting data I watch it in the debugger, I save it to a log, I print it even in the console and all of them are in the same way (while with ISO-8859-1 and windows-1252 it is shown correctly). After persisting it in the database I can see exactly the same by using the admin of PostgreSQL. Do you really think Eclipse, the console, the logs and later the PostgreSQL admin are not able to print it correctly? I think it must be a problem regarding Javamail. – Javi Nov 14 '12 at 19:54
  • [Here is the documentation](http://javamail.kenai.com/nonav/javadocs/javax/mail/internet/MimeMessage.html#getContent()) What type does `getContent();` return in your utf-8 case? – jlordo Nov 14 '12 at 20:12

4 Answers4

1

How are you detecting that these messages have "strange characters"? Are you displaying the data somewhere? It's possible that whatever method you're using to display the data isn't handling Unicode characters properly.

The first step is to determine whether the problem is that you're getting the wrong characters, or that the correct characters are being displayed incorrectly. You can examine the Unicode values of each character in the data (e.g., in the String returned from the getContent method) to make sure each character has the correct Unicode value. If it does, the problem is with the method you're using to display the characters.

Bill Shannon
  • 27,854
  • 5
  • 34
  • 37
  • I watch it in the debugger of Eclipse, but I can see it also in the postgresql database. I don't think it is a problem of eclipse and PgAdminIII. Indeed when I read from that table I get again that encoding problem in this field. – Javi Nov 14 '12 at 19:19
  • Again, do as I suggested to determine where the problem is being introduced. – Bill Shannon Nov 14 '12 at 23:25
0

try this and let me know if it works:

if ( *check if utf 8 here* ) {
    content = getUTF8Content(contentObject);
}

// TODO take care of UnsupportedEncodingException, 
// IOException and ClassCastException
public static String getUTF8Content(Object contentObject) {
    // possible ClassCastException
    SharedByteArrayInputStream sbais = (SharedByteArrayInputStream) contentObject;
    // possible UnsupportedEncodingException
    InputStreamReader isr = new InputStreamReader(sbais, Charset.forName("UTF-8"));
    int charsRead = 0;
    StringBuilder content = new StringBuilder();
    int bufferSize = 1024;
    char[] buffer = new char[bufferSize];
    // possible IOException
    while ((charsRead = isr.read(buffer)) != -1) {
        content.append(Arrays.copyOf(buffer, charsRead));
    }
    return content.toString();
}

BTW, is JavaMail 1.4.1 a requirement? Up to date version is 1.4.5.

jlordo
  • 35,188
  • 6
  • 52
  • 77
  • The above is effectively what JavaMail does internally when returning a String for any part, using the charset in the message. – Bill Shannon Nov 14 '12 at 23:29
  • Are you saying this won't work for you? I came up with it without looking at the sources. This reads the bytes from the underlying byte array. If they are wrong in that array, then they are wrong you need to check how they get in there. – jlordo Nov 14 '12 at 23:34
  • I'm saying JavaMail already does the same thing so there's no need to do it in the application. And yes, as you say, if the wrong bytes are in the message something else is going wrong. It's possible, for example, that the program creating the message is putting iso-8859-1 bytes in the message, but setting the charset in the header to "utf-8". Spam programs are often broken like that. – Bill Shannon Nov 15 '12 at 04:18
  • In your post you write 'so the message seems to be OK.'. If the message contains the wrong bytes, how can it be displayed correctly? – jlordo Nov 15 '12 at 07:30
  • I have tried that piece of code for one of the emails I'm having problems with and for it the type of the Object contentObject is java.lang.String. When I try to invoke getUTF8Content() with it it throws a ClassCastExceptionin the first line: java.lang.ClassCastException: java.lang.String cannot be cast to javax.mail.util.SharedByteArrayInputStream – Javi Nov 15 '12 at 08:58
  • Thats why i have the `if` on top, first check the type with `message.getContentType();`, if it's not utf8, don't call the method – jlordo Nov 15 '12 at 09:02
  • I'm just invoking that method for one message and it is UTF-8. I control the uids of the message and I have placed that call inside an if which checks if(uid.trim().equals("1401")) where 1401 is the uid of one of the problematic emails. I even print to console in that if the encoding, content type and content to be sure of that message, and the content type is the expected: "text/plain; charset="utf-8" while the getEncoding() function returns "quoted-printable". – Javi Nov 15 '12 at 09:14
  • show me your call of the method, also, if you have an uid for your emails, where do your emails come from? – jlordo Nov 15 '12 at 09:16
  • @jlordo I've updated the question and added there the piece of code – Javi Nov 15 '12 at 09:24
  • I'm just reading the emails of the inbox of an account where I receive emails sent by other people. The uids are the ones given by JavaMail with the method getUID() on the POP3Folder object. – Javi Nov 15 '12 at 09:27
  • Well, if `getContentType()` states UTF-8, but the Bytes are incorrect UTF-8, the error may as well be on the sending side... Sorry – jlordo Nov 15 '12 at 11:01
0

What worked for me was that I called getContentType() and I would check if the String contains a "utf" in it (defining the charset used as one of UTF).

If yes, I would treat the content differently in this case.

private String encodeCorrectly(InputStream is) {
    java.util.Scanner s = new java.util.Scanner(is, StandardCharsets.UTF_8.toString()).useDelimiter("\\A");
    return s.hasNext() ? s.next() : "";
}

(a modification of a IS to String converter from this answer on SO)

The important part here is using the correct Charset. This solved the issue for me.

Community
  • 1
  • 1
Ev0oD
  • 1,047
  • 13
  • 22
0

First of all you must add headers according to UTF-8 encoding this way:

...
MimeMessage msg = new MimeMessage(session);
msg.setHeader("Content-Type", "text/html; charset=UTF-8");
msg.setHeader("Content-Transfer-Encoding", "8bit");

msg.setFrom(new InternetAddress(doConversion(from)));
msg.setRecipients(javax.mail.Message.RecipientType.TO, address);
msg.setSubject(asunto, "UTF-8");

MimeBodyPart mbp1 = new MimeBodyPart();
mbp1.setContent(text, "text/html; charset=UTF-8");
Multipart mp = new MimeMultipart();
mp.addBodyPart(mbp1);
...

But for 'from' header, i use the following method to convert characters:

public String doConversion(String original) {
    if(original == null) return null;
    String converted = original.replaceAll("á", "\u00c3\u00a1");
    converted = converted.replaceAll("Á", "\u00c3\u0081");
    converted = converted.replaceAll("é", "\u00c3\u00a9");
    converted = converted.replaceAll("É", "\u00c3\u0089");
    converted = converted.replaceAll("í", "\u00c3\u00ad");
    converted = converted.replaceAll("Í", "\u00c3\u008d");
    converted = converted.replaceAll("ó", "\u00c3\u00b3");
    converted = converted.replaceAll("Ó", "\u00c3\u0093");
    converted = converted.replaceAll("ú", "\u00c3\u00ba");
    converted = converted.replaceAll("Ú", "\u00c3\u009a");
    converted = converted.replaceAll("ñ", "\u00c3\u00b1");
    converted = converted.replaceAll("Ñ", "\u00c3\u0091");
    converted = converted.replaceAll("€", "\u00c2\u0080");
    converted = converted.replaceAll("¿", "\u00c2\u00bf");
    converted = converted.replaceAll("ª", "\u00c2\u00aa");
    converted = converted.replaceAll("º", "\u00c2\u00b0");
    return converted;
}

You can see the corresponding UTF-8 hex encoding in UTF at http://www.fileformat.info/info/charset/UTF-8/list.htm if you need to include some other characters.

jlbofh
  • 371
  • 3
  • 6