2

I get html file which I need to read and parse, this file can be in plain English, japenese, or any language with associated character encoding required for that language. The problem occurs when file is in Japenese with any of these encodings

  • Shift JIS
  • EUC-JP
  • ISO-2022-JP

I tried reading file with FileReader but resulting file is all garbage characters. I also tried using FileInputStream with just hard coding japenese encoding to check if Japanese file is read correctly but result is not as expected.

FileInputStream fis = new FileInputStream(htmlFile);
InputStreamReader isr = new InputStreamReader(fis, " ISO-2022-JP");

I don’t have much experience with character encoding and internationalization, any suggestions on how I can read/write files with different encodings?

one more thing, I don't know how to get the character encoding of the html file I am reading, I understand that I need to write file in same encoding but not sure how to get original file's encoding Thanks,

axtavt
  • 228,184
  • 37
  • 489
  • 472
alwaysLearning
  • 99
  • 1
  • 11
  • 1
    Sure! Accept some answers to your past questions. – awm Mar 04 '11 at 14:39
  • 1
    Can you show some examples of input and result? – axtavt Mar 04 '11 at 14:50
  • Where exactly does this HTML file come from? From a website? What exactly do you want to do with this HTML file? Extract some data? – BalusC Mar 04 '11 at 14:56
  • right, it comes from a actual site, which has different encodings, I read it, parse it and using thrird party lib, I create image for it, image comes completely gibberish. – alwaysLearning Mar 04 '11 at 15:06
  • Michael has already answered it. Check it in the HTTP headers. More detail about how exactly to approach this using Java code depends on the APIs/libraries you're using to get the HTML file from the website and to convert it to an image. – BalusC Mar 04 '11 at 15:19

1 Answers1

4
  • Forget that FileReader exists, it implicitly uses the platform default encoding, which makes it pretty much useless.
  • Your code with the hardcoded encoding is correct except for the encoding itself, which has a leading space. If you remove it, the code should correctly read ISO-2022-JP encoded files
  • As for getting the character encoding of the HTML file, there are a number of ways it can be transmitted
    • on the HTTP level in a Content-Type HTTP header - but this is only available when you read the file from the webserver, not when it's saved as a file
    • as a corresponding META HTML tag: <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
    • or, if the document type is XHTML, in the XML declaration: <?xml version="1.0" encoding="UTF-8"?>
Michael Borgwardt
  • 327,225
  • 74
  • 458
  • 699