
I have strings similar to below ones

1. the quick brown `[fox].[jumps]` [over] the lazy dog
 2. the quick brown fox [jumps] [over] the lazy dog
 3. `[the].[quick]` brown `[fox].[jumps]` [over] the lazy dog

I would need to extract below values

 1. fox.jumps
 2. <Nothing>
 3. the.quick, fox.jumps

Please could you help me with the regular expressions in shell scripts?

  1
  • I did try the regular expressions such as sed 's/\(\[.*\]\)\.\(\[.*\]\)/\1\2/g' . But this doesn't seem to be working for me – Waseem Ahmed Feb 09 '21 at 11:49
  1
  1
    @WaseemAhmed `[.*]` is a bracket expression which matches either of the literal characters `.` or `*`. – Ed Morton Feb 09 '21 at 12:14
  1
  1
  • can i assume the \`[fox].[jumps]\` backtick quotation marks are always there in the data? – RARE Kpop Manifesto Feb 11 '21 at 21:36

5 Answers5


With GNU awk for multi-char RS, RT, and gensub():

$ awk -v RS='[[][^]]*][.][[][^]]*]' 'RT{print gensub(/[][]/,"","g",RT)}' file
Ed Morton
  • Thanks for the code it worked for me. Would you please explain the pattern in record separator and gensub? – Waseem Ahmed Feb 09 '21 at 20:26
  • They aren't patterns, theyre regular expressions. The RS is 2 instances of `[[]` (a `[`) then `[^]]*` (zero or more `]`s) then `]` with `[.]` (a `.`) between them. In the gensub `[][]]` is a bracket expression that'll match all `[`s or `]`s. – Ed Morton Feb 09 '21 at 22:29

With your shown samples, could you please try following. Written and tested in GNU awk.

awk '
      val=(val?val ", ":"")$i
  print (val==""?"<Nothing>":val)
}'  Input_file

Sample output will be as follows as per shown samples.

the.quick, fox.jumps
Ed Morton
This is another one-liner ... from shell point of view (you can remove \ and newline to make it one line).

Make sure \ is always last character of the line and no space after that.

gawk '{\
            if(c!="[" && c!="]")\
         printf(" ");\
}' example.txt

Anyway, it would be useful to put the gawk-code in between ' and ' into a file (file.awk, in file.awk remove all \) and then call, gawk like so, meaning test.awk starts with { and ends with }. It might not be an elegant solution, but you can add a lot more to this, like many variables, a whole program, subroutines, ...

gawk -f test.awk example.txt

the.quick fox.jumps

    You don't need those backslashes at the end of every line, you can just remove them. Ditto for the semi-colons at the end of lines unless you plan to pack it all into one line. Since you aren't using RSTART or RLENGTH there's no point calling `match($i,/re/)` instead of just doing `$i ~ /re/`. Looping through every character in `$i` at a time and not printing if it's `[` or `]` is very inefficient compared to `gsub(/[][]/,"",$i)`. `printf "\n"` is usually equivalent to `print ""` but the latter is more portable and so better. – Ed Morton Feb 09 '21 at 12:47
  1
  • Ed Morton: You are right. My solution is not efficient. It depends on size of data, if that is relevant. In case, it really is, then it might even be useful to use C. – sidcoder Feb 09 '21 at 13:11
  2
    C isn't necessarily faster than awk for text processing (see https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/ and note that gawk is much faster now than it was back then). – Ed Morton Feb 09 '21 at 13:19
  • and mawk2 beta (mawk is anywhere from 2.0x to 3.5x faster than mawk 1.3.4 - and that's running merely as a single-threaded application on my mac, utilizing only 1 of the 8 cores. can't imagine the insanity of the beast if someone could make it leverage multi-cores. yes multi-byte octals are VERY hideous relative to UTF8 code points in hex, and that's the necessary evil required to attain performance on certain tasks faster than compiled C binaries. – RARE Kpop Manifesto Feb 12 '21 at 00:01

With sed (that supports \n in s/ commands).

sed '
    : again
        b again
    s/\n/, /g
  • s/$/\n/ add a newline on the end of read line. Important in case of lines without any regex.
  • : again define label again that you can go to
  • /../ - match a regex
    • \([^\n]*\) match any non-newline and remember it in \1
    • \[\([^]]*\)\]\.\[\([^]]*\)\] Match [somethign].[something] and remember parts in \2 and \3
    • \([^\n]*\) - match any non-newline
  • /.../{ - when the regex is matched - s// - reuse last regex, ie. the one above - /\1\4\n\2.\3\n/ - shuffle input so that place the non-interesting parts before the newline, and extracted interesting part after the newline - b again - go to again, to match another pattern
  • s/[^\n]*\n// remove the non-matched part of line
  • s/\n$// - remove trailing newline
  • s/\n/, /g separate parts with comma and a space.


$ sed 's/$/\n/; : again; /\([^\n]*\)\[\([^]]*\)\]\.\[\([^]]*\)\]\([^\n]*\)\n/{ s//\1\4\n\2.\3\n/; b again; }; s/[^\n]*\n//; s/\n$//; s/\n/, /g' <<EOF
the quick brown [fox].[jumps] [over] the lazy dog
the quick brown fox [jumps] [over] the lazy dog
[the].[quick] brown [fox].[jumps] [over] the lazy dog



the.quick, fox.jumps

If you do not want the empty line in between, then do not output anything from sed in such case where no patterns where found. Add sed -n and on the end of script do not output if empty - /^$/!p, like so:

sed -n 's/$/\n/; : again; /\([^\n]*\)\[\([^]]*\)\]\.\[\([^]]*\)\]\([^\n]*\)\n/{ s//\1\4\n\2.\3\n/; b again; }; s/[^\n]*\n//; s/\n$//; s/\n/, /g; /^$/!p'
With GNU awk for FPAT (using regexp and gensub function from Ed Morton's code):

awk -v OFS=', ' -v FPAT='[[][^]]*][.][[][^]]*]' '{for (i=1; i<=NF; i++) printf "%s%s", gensub(/[][]/,"","g",$i), (i<NF?OFS:ORS)}' file
the.quick, fox.jumps
Carlos Pascual
