Your first assumption is wrong. We have protocols to define what a file (or a packet) contain and how to interpret the contain. We should always split meta-data with data. You practically are pushing to put BOM as meta-data, which describe the following bytes, but this is not enough. Text data is not a so useful information: we still need to understand and interpret what it is the meaning of text data. The more obvious part is about interpreting U+0020 (white space) either as a printed character or as a control data. HTML interpret as the second (two whitespaces are not so special, or a white space and a new line, but in <pre>
). But also: we have a mail, a mailbox, a markdown file, a HTML, etc.. BOM doesn't help alone. But so, for your first point, we need to add more information, and so on. But then we have a general container format (metadata, with one or more data), so it is not more text, and it is not BOM which help us.
If you need BOM, you already lost the battle, and you can have a BOM which it is not really a BOM, but real data in other encoding. Using 2 bytes or 3 bytes are not enough (shebang, which it is old, used 4 bytes, #! /
, now space is not more required, but in any case it is an old protocol, when files were not heavy exchanged, and the path is relevant (nobody execute random files, and if it not a shebang, it was an a.out file).
And you are discussing a old stuff. Now everything is UTF-8. No need to BOM. Microsoft is just making thing more complex: Unix, Linux, macos did a short transition without much hurt (and no "flag day"). Also web: it is UTF-8 by default. Your question is about programming languages, but there UTF-8 is fine: they uses ASCII in syntax, and what it is in strings it doesn't matter so much: it is standard to treat strings and Unicode as opaque objects, but for few cases, else you will miss something from Unicode (e.g. splitting combining chars, splitting emoji e.g. in language which works with UTF16 code units, etc.).
UTF-16 is not one thing you will write programs. It may be used by API (fixed length may be/seem better), or ev. for data, but usually not for coding.
And BOM doesn't help, if you do not modify all scripts/programs (but so, lets' do it as "all is UTF-8"): it is not seldom to find program sources in multiple encoding (and on the same file): you may have copied-pasted the copyright (and so author name) with a simple editor (and with one encoding), then strings may be on other encoding, and few comments (and committers name) maybe on a different one. And git (and other tools) just check lines so it may insert lines with wrong encoding: git has very few information, and users often have incorrect configuration. So you may break sources which where ok (just because encoding problems were just in comments).
Then a short comment on the second assumption, which it is also difficult.
You want to split layers, but this is very problematic: we have scripts which contain binary data at the end, so operating system should not try to transcode the script (and so then to remove BOM), because first part may be just text, but some part may requires exact the correct bytes. (And some Unicode test files are also in this category, and they are text, possibly with some invalid code).
Just use UTF-8 without BOM and all things become much simpler.