-2

I try to use awk to remove all byte order marks from a file (I have many of them):

awk '{sub(/\xEF\xBB\xBF/,"")}{print}' f1.txt > f2.txt

It seems to remove all the BOMs that are in the beginning of the line but those in the middle are not removed. I can verify that by:

grep -U $'\xEF\xBB\xBF' f2.txt

Grep returns me one line where BOM is in the middle.

Roman
  • 97,757
  • 149
  • 317
  • 426
  • 3
    Only thing that comes to mind is that there are more than one BOM in one record and `sub` only removes the first one. Easy to verify with `gsub`. – James Brown Nov 20 '17 at 09:06
  • gsub really resolves the problem However, I still do not understand why. Doesn't sub replace every occurrence of a specified sub-string by another specified sub-string? – Roman Nov 20 '17 at 09:17
  • 1
    No, it only replaces the first occurrence in each record. – James Brown Nov 20 '17 at 09:26

1 Answers1

1

As mentioned sub() will only swap out the leftmost substring, so if global is what you're after then using gsub(), or even better gensub() is the way to go.

sub(regexp, replacement [, target])

Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Modify the entire string by replacing the matched text with replacement. The modified string becomes the new value of target. Return the number of substitutions made (zero or one).

gsub(regexp, replacement [, target])

Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere.

gensub(regexp, replacement, how [, target]) #

Search the target string target for matches of the regular expression regexp. If how is a string beginning with ‘g’ or ‘G’ (short for “global”), then replace all matches of regexp with replacement. Otherwise, "how" is treated as a number indicating which match of regexp to replace. gensub() is a general substitution function. Its purpose is to provide more features than the standard sub() and gsub() functions.

There's tons more helpful information and examples linked below:

The GNU Awk User's Guide: String Functions / 9.1.3 String-Manipulation Functions

l'L'l
  • 40,316
  • 6
  • 77
  • 124