Justification of BOM mark in file encoding

Question

I want to be confident that using BOM mark for file encoding is absolutely needed for a file for the following reasons.

A information of a file must be self-contained. We didn't figure out a clear algorithm for identifying which encoding is appropriate for a file.
For the compatibility issue about the shebang line, this issue need to be corrected inside the script language because the encoding is much higher concept than the shebang line.

For the first claim, I have difficult time to determine which encoding is right or not for a file. Therefore, applying different encoding for a file appeared frequently and I guess that most of the fresh developers encounter this situation and ignore the weird characters in a file due to different encoding strategy.

I already recognize the compatibility is an important aspect for software maintenance. However, I think that the old rule that makes the system confuse is changed for future steps.

Is that any thought or any movement to make adding BOM mark as official? Or is there any critical reason that we must not introduce BOM mark? (e.g. A clear algorithm to identify the encoding file exists.)

My understanding comes from the following link, so the additional link to change my perspective would be a great pleasure.

What's the difference between UTF-8 and UTF-8 without BOM?

Thanks,

FWIW, for quite a time now, by default, Windows notepad (billions of end-users) saves as UTF-8 without BOM (used to be ANSI/CodePage). That doesn't mean BOM isn't super useful, as outlined in the link. — Simon Mourier, Apr 01 '21 at 07:01
A lot of software will not correctly process UTF-8 files with BOM. While it might help in a few cases and can be helpful in a closed environment, it will cause more problems than good in the real open world. — Codo, Apr 01 '21 at 07:21
@Codo I understand your comment that the BOM could give some benefit, but it causes numerous engineering issue. I agree with your comment. Think about the case that the print statement in python2. Numerous applications written in python 2 use print with whitespace, but it is abolished in python 3 and the print with whitespace is replaced with print with parenthesis. I think this kind of conversion is needed for the encoding rule. What do you think? BTW, thanks for the comments. — Seonghyeon Lee, Apr 01 '21 at 08:10

score 0 · Accepted Answer · answered Apr 01 '21 at 08:11

Your first assumption is wrong. We have protocols to define what a file (or a packet) contain and how to interpret the contain. We should always split meta-data with data. You practically are pushing to put BOM as meta-data, which describe the following bytes, but this is not enough. Text data is not a so useful information: we still need to understand and interpret what it is the meaning of text data. The more obvious part is about interpreting U+0020 (white space) either as a printed character or as a control data. HTML interpret as the second (two whitespaces are not so special, or a white space and a new line, but in <pre>). But also: we have a mail, a mailbox, a markdown file, a HTML, etc.. BOM doesn't help alone. But so, for your first point, we need to add more information, and so on. But then we have a general container format (metadata, with one or more data), so it is not more text, and it is not BOM which help us.

If you need BOM, you already lost the battle, and you can have a BOM which it is not really a BOM, but real data in other encoding. Using 2 bytes or 3 bytes are not enough (shebang, which it is old, used 4 bytes, #! /, now space is not more required, but in any case it is an old protocol, when files were not heavy exchanged, and the path is relevant (nobody execute random files, and if it not a shebang, it was an a.out file).

And you are discussing a old stuff. Now everything is UTF-8. No need to BOM. Microsoft is just making thing more complex: Unix, Linux, macos did a short transition without much hurt (and no "flag day"). Also web: it is UTF-8 by default. Your question is about programming languages, but there UTF-8 is fine: they uses ASCII in syntax, and what it is in strings it doesn't matter so much: it is standard to treat strings and Unicode as opaque objects, but for few cases, else you will miss something from Unicode (e.g. splitting combining chars, splitting emoji e.g. in language which works with UTF16 code units, etc.).

UTF-16 is not one thing you will write programs. It may be used by API (fixed length may be/seem better), or ev. for data, but usually not for coding.

And BOM doesn't help, if you do not modify all scripts/programs (but so, lets' do it as "all is UTF-8"): it is not seldom to find program sources in multiple encoding (and on the same file): you may have copied-pasted the copyright (and so author name) with a simple editor (and with one encoding), then strings may be on other encoding, and few comments (and committers name) maybe on a different one. And git (and other tools) just check lines so it may insert lines with wrong encoding: git has very few information, and users often have incorrect configuration. So you may break sources which where ok (just because encoding problems were just in comments).

Then a short comment on the second assumption, which it is also difficult.

You want to split layers, but this is very problematic: we have scripts which contain binary data at the end, so operating system should not try to transcode the script (and so then to remove BOM), because first part may be just text, but some part may requires exact the correct bytes. (And some Unicode test files are also in this category, and they are text, possibly with some invalid code).

Just use UTF-8 without BOM and all things become much simpler.

First, thanks for the answer. Actually, I don't need any information for a file except encoding. I really don't want to know the secondary information such as extension. BTW, does the metadata file need additional custom encoding rule? Also, as you mentioned, we have to assume that "all files are encoded as UTF-8". In practice, this is not the case, we frequently encounter the file which is encoded with other unicode. You also say that UTF-16 will never be taken as default, but who knows? For the second, I completely understand it is hard, but I think it is necessary if BOM is accepted. — Seonghyeon Lee, Apr 01 '21 at 08:42
Without BOM, we have to decode all data stream using all unicode rules and take an encoding rule which doesn't raise an error. Is there any other better algorithm for determining the appropriate encoding? The reason for this question is to make a confident view for file encoding. If my tone is rude, please forgive me. Thanks, — Seonghyeon Lee, Apr 01 '21 at 08:48
Also, for the case of media, the header is contained in the file.. What do you think about it? Should the metadata for video file be separated from the file itself? — Seonghyeon Lee, Apr 01 '21 at 08:52
I agree with the comment that Microsoft make the thing complex. I search out the difference between "utf-8" and "utf-8-sig" due to the phenomenon that the csv file with "utf-8" breaks whereas "utf-8-sig" works in Microsoft Excel. — Seonghyeon Lee, Apr 01 '21 at 08:57
encoding should be specified by protocol or headers. video: it is a container, but when we read text file we want text, not all data (on video you may not look all stream and all audio channels). Web has own protocol (check if there is BOM and it seems OK), else UTF-8 (and not other encodings, but if UTF-8 gives errors). I think that protocol is OK (not perfect, but UTF-8 is a subset of possible data: bit format). Else treat as Latin-1 (if we go too much in past we have so many coding which it is impossible to get automatically) — Giacomo Catenazzi, Apr 01 '21 at 09:27
I'm really appreciate your kindness answer. This talk is informative for me! Thanks :) — Seonghyeon Lee, Apr 01 '21 at 11:43

Justification of BOM mark in file encoding

1 Answers1