15

I have some working code with a crutch to add BOM marker to a new file.

  #writing
  File.open name, 'w', 0644 do |file|
    file.write "\uFEFF"
    file.write @data
  end

  #reading
  File.open name, 'r:bom|utf-8' do |file|
    file.read
  end

Is there any way to automatically add the marker without writing cryptic "\uFEFF" before the data? Something like File.open name, 'w:bom' # this mode has no effect maybe?

ujifgc
  • 2,075
  • 1
  • 18
  • 20
  • 1
    Please do not use a BOM for UTF-8!!! It is ***neither required nor recommended*** by the Unicode Consortium. – tchrist Mar 28 '12 at 00:24
  • Thanks for your guidance, master. I came for the problem with a different approach and patched my template engine to respect Encoding.default_external. – ujifgc Mar 28 '12 at 08:10
  • 1
    "No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32." http://www.unicode.org/faq/utf_bom.html – Jan Mar 18 '13 at 09:38
  • 7
    unless you are in windows world and they make you apply a BOM to ascii files to be recognized as UTF-8 – jtruelove Oct 11 '13 at 19:38
  • 4
    @tchrist actually the BOM is recommended by the Unicode Consortium in some circumstances. See http://www.unicode.org/faq/utf_bom.html#bom10 The circumstances are: 1. conforming to certain protocols (e.g. Microsoft .txt files); 2. to specify encoding or endianness of text streams where that would otherwise be unclear, within protocols that allow it. – Dave Burt Feb 11 '16 at 22:18

2 Answers2

11

**** This answer lead to a new gem: file_with_bom ****

I had the similar problem in the past and I extended File.open with additional encoding variants for the w-mode:

class File
  BOM_LIST_hex = {
      Encoding::UTF_8      => "\xEF\xBB\xBF", #"\uEFBBBF"
      Encoding::UTF_16BE => "\xFE\xFF", #"\uFEFF",
      Encoding::UTF_16LE => "\xFF\xFE",
      Encoding::UTF_32BE => "\x00\x00\xFE\xFF",
      Encoding::UTF_32LE => "\xFE\xFF\x00\x00",
    }
  BOM_LIST_hex.freeze
  def utf_bom_hex(encoding = external_encoding)
    BOM_LIST_hex[encoding]
  end

class << self
  alias :open_old :open
  def open(filename, mode_string = 'r', options = {}, &block)
    #check for bom-flag in mode_string
    options[:bom] = true if mode_string.sub!(/-bom/i,'')

    f = open_old(filename, mode_string, options)
    if options[:bom]
      case mode_string
        #r|bom already standard since 1.9.2
        when /\Ar/   #read mode -> remove BOM
          #remove BOM
          bom = f.read(f.utf_bom_hex.bytesize) 
          #check, if it was really a bom
          if bom != f.utf_bom_hex.force_encoding(bom.encoding)
            f.rewind  #return to position 0 if BOM was no BOM
          end
        when /\Aw/  #write mode -> attach BOM
          f = open_old(filename, mode_string, options)
          f << f.utf_bom_hex.force_encoding(f.external_encoding)
        end #mode_string
    end

    if block_given?
      yield f 
      f.close
    end
  end
  end
end #File

Testcode:

EXAMPLE_TEXT = 'some content öäü'
File.open("file_utf16le.txt", "w:utf-16le|bom"){|f| f << EXAMPLE_TEXT }
File.open("file_utf16le.txt", "r:utf-16le|bom:utf-8"){|f| p f.read }
File.open("file_utf16le.txt", "r:utf-16le:utf-8",  :bom => true ){|f| p f.read }
File.open("file_utf16le.txt", "r:utf-16le:utf-8"){|f| p f.read }

File.open("file_utf8.txt", "w:utf-8", :bom => true ){|f| f << EXAMPLE_TEXT }
File.open("file_utf8.txt", "r:utf-8", :bom => true ){|f| p f.read }
File.open("file_utf8.txt", "r:utf-8|bom",              ){|f| p f.read }
File.open("file_utf8.txt", "r:utf-8",                     ){|f| p f.read }

Some remarks:

  • The code is from pre 1.9-times (but it still works).
  • I used -bom as a bom indicator (ruby 1.9 uses |bom.

Some needed fixes to be better:

  • use |bom instead -bom
  • use the standard r|bom for reading
  • make it ruby 1.8 and 1.9 enabled

Perhaps I will find some time tomorrow to refactor my code and provide it as a gem.

knut
  • 25,655
  • 5
  • 79
  • 107
5

Alas I think your manual approach is the way to go, at least I don't know a better way:

http://blog.grayproductions.net/articles/miscellaneous_m17n_details

To quote from JEG2's article:

Ruby 1.9 won't automatically add a BOM to your data, so you're going to need to take care of that if you want one. Luckily, it's not too tough. The basic idea is just to print the bytes needed at the beginning of a file.

Michael Kohl
  • 63,285
  • 10
  • 129
  • 152
  • 2
    You do not want a BOM in a UTF-8 file!! – tchrist Mar 28 '12 at 00:24
  • 4
    Can you please explain why we don't want to do that? :) @tchrist – fab Aug 03 '15 at 20:30
  • @fab because it's unnecessary with utf8. With utf16 the bom indicates whether the file is big endian or little endian, there is a bom for one and a bom for the other and the bom is mandatory, but with utf8 there aren't two variant boms, the bom is unnecessary https://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom?rq=1 – barlop Dec 12 '17 at 22:48