1

My content contains multiple BOM (EF BB BF) characters and I want to remove them. The characters are in the middle of strings I want to simply remove them all.

The data comes from a JavaScript source, which I get from a CKEditor instance. Then I POST the variable and read it as string on my backend and the BOMS are there. For now, they are persisted as is, but this results in errors in post-processing when the characters are interpreted and start showing up mid-content. I suspect they come from something that was copypasted into my CKEditor.

I can step through the string char by char, but I don't know how to compare against the BOM. Would it somehow be possible to compare the hex values of the string bytes and compare three byte sequences?

Joel Peltonen
  • 11,167
  • 4
  • 60
  • 93

2 Answers2

5

The utf-8 BOM bytes get translated to \ufeff. Unicode character "Zero width no-break space", can't see them, can't hear them. Filter them out with:

   var good = bad.Replace("\ufeff", "");
Hans Passant
  • 873,011
  • 131
  • 1,552
  • 2,371
  • Great success! One question though, might this cause problems by removing other bytes that get translated into the same unicode character? I doubt that I'll miss any if they get removed but are there other important or worth-mentioning such characters? – Joel Peltonen Oct 23 '12 at 09:58
  • 1
    You can't see them, you can't hear them. – Hans Passant Oct 23 '12 at 10:15
0

Try the following:

CleanString = DirtyString.Replace("\u00EF\u00BB\u00BF", null);
Peter Stock
  • 191
  • 7
  • The way I tested this was to do `string s2 = s.Replace(...)` and then `Debug.WriteLine(s2);`. Then I copy-pasted the output from my output window to Notepad++ and switched to view HEX: I still see the BOM. Did I try it wrong? – Joel Peltonen Oct 23 '12 at 07:26
  • That's how it is working for me. Maybe you find [this](http://stackoverflow.com/questions/2502990/create-text-file-without-bom?rq=1) helpful. – Peter Stock Oct 23 '12 at 09:58