Extract strings between 2 texts enclosed between squared brackets

Question

I have strings similar to below ones

1. the quick brown `[fox].[jumps]` [over] the lazy dog
 2. the quick brown fox [jumps] [over] the lazy dog
 3. `[the].[quick]` brown `[fox].[jumps]` [over] the lazy dog

I would need to extract below values

 1. fox.jumps
 2. <Nothing>
 3. the.quick, fox.jumps

Please could you help me with the regular expressions in shell scripts?

Welcome to SO, please do add your efforts in form of code in your question, which is highly encouraged on SO, thank you. — RavinderSingh13, Feb 09 '21 at 11:20
I did try the regular expressions such as sed 's/$\[.*\]$\.$\[.*\]$/\1\2/g' . But this doesn't seem to be working for me — Waseem Ahmed, Feb 09 '21 at 11:49
Thanks Waseem for letting us know, please do add them in your question that's always recommended, happy learning. — RavinderSingh13, Feb 09 '21 at 11:53
@WaseemAhmed : Please post the exact code you tried to solve the problem. — user1934428, Feb 09 '21 at 12:13
@WaseemAhmed `[.*]` is a bracket expression which matches either of the literal characters `.` or `*`. — Ed Morton, Feb 09 '21 at 12:14
@WaseemAhmed : In your comment, you say that you need a solution for `sed`, but you also tagged your question as `awk`. Please specify in your question, for what tool you are looking for a regexp. — user1934428, Feb 09 '21 at 12:19
@user1934428 the OP doesn't say anywhere that they need a solution for sed, they just shared a sed command they had tried. — Ed Morton, Feb 09 '21 at 12:26
Right, and but if he does not want something for _sed_, why should he have tried it? At least, since regexp is not identical in all tools, the question should specify what tools he is comfortable to use. — user1934428, Feb 09 '21 at 12:32
They tagged the question with sed and awk so it seems like they'd be comfortable with either of those and it's reasonable/expected that they'd have tried at least one of them. — Ed Morton, Feb 09 '21 at 12:39
can i assume the \`[fox].[jumps]\` backtick quotation marks are always there in the data? — RARE Kpop Manifesto, Feb 11 '21 at 21:36

score 1 · Accepted Answer · answered Feb 09 '21 at 11:29

1

With GNU awk for multi-char RS, RT, and gensub():

$ awk -v RS='[[][^]]*][.][[][^]]*]' 'RT{print gensub(/[][]/,"","g",RT)}' file
fox.jumps
the.quick
fox.jumps

answered Feb 09 '21 at 11:29

Ed Morton

157,421
15
62
152

Thanks for the code it worked for me. Would you please explain the pattern in record separator and gensub? – Waseem Ahmed Feb 09 '21 at 20:26
They aren't patterns, theyre regular expressions. The RS is 2 instances of `[[]` (a `[`) then `[^]]*` (zero or more `]`s) then `]` with `[.]` (a `.`) between them. In the gensub `[][]]` is a bracket expression that'll match all `[`s or `]`s. – Ed Morton Feb 09 '21 at 22:29

score 0 · Answer 2 · edited Feb 09 '21 at 12:37

0

With your shown samples, could you please try following. Written and tested in GNU awk.

awk '
{
  val=""
  for(i=1;i<=NF;i++){
    if($i~/^\[.*]\.\[.*]$/){
      gsub(/[][]/,"",$i)
      val=(val?val ", ":"")$i
    }
  }
  print (val==""?"<Nothing>":val)
}'  Input_file

Sample output will be as follows as per shown samples.

fox.jumps
<Nothing>
the.quick, fox.jumps

edited Feb 09 '21 at 12:37

Ed Morton

157,421
15
62
152

answered Feb 09 '21 at 12:26

RavinderSingh13

101,958
9
41
77

sidcoder · Answer 3 · 2021-02-09T12:54:21.787

0

This is another one-liner ... from shell point of view (you can remove \ and newline to make it one line).

Make sure \ is always last character of the line and no space after that.

gawk '{\
  for(i=1;i<NF;i++)\
  {\
     if(match($i,/\]\.\[/)>0)\
     {\
         for(k=1;k<length($i);k++)\
         {\
            c=substr($i,k,1);\
            if(c!="[" && c!="]")\
            printf("%s",c);\
         }\
         printf(" ");\
     }\
  }\
  printf("\n");\
}' example.txt

Anyway, it would be useful to put the gawk-code in between ' and ' into a file (file.awk, in file.awk remove all \) and then call, gawk like so, meaning test.awk starts with { and ends with }. It might not be an elegant solution, but you can add a lot more to this, like many variables, a whole program, subroutines, ...

gawk -f test.awk example.txt

Output:
fox.jumps 

the.quick fox.jumps

edited Feb 09 '21 at 12:54

answered Feb 09 '21 at 12:42

sidcoder

206
6

2

You don't need those backslashes at the end of every line, you can just remove them. Ditto for the semi-colons at the end of lines unless you plan to pack it all into one line. Since you aren't using RSTART or RLENGTH there's no point calling `match($i,/re/)` instead of just doing `$i ~ /re/`. Looping through every character in `$i` at a time and not printing if it's `[` or `]` is very inefficient compared to `gsub(/[][]/,"",$i)`. `printf "\n"` is usually equivalent to `print ""` but the latter is more portable and so better. – Ed Morton Feb 09 '21 at 12:47
1

`k – Ed Morton Feb 09 '21 at 12:49
Ed Morton: You are right. My solution is not efficient. It depends on size of data, if that is relevant. In case, it really is, then it might even be useful to use C. – sidcoder Feb 09 '21 at 13:11
2

C isn't necessarily faster than awk for text processing (see https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/ and note that gawk is much faster now than it was back then). – Ed Morton Feb 09 '21 at 13:19
and mawk2 beta (mawk 1.9.9.6) is anywhere from 2.0x to 3.5x faster than mawk 1.3.4 - and that's running merely as a single-threaded application on my mac, utilizing only 1 of the 8 cores. can't imagine the insanity of the beast if someone could make it leverage multi-cores. yes multi-byte octals are VERY hideous relative to UTF8 code points in hex, and that's the necessary evil required to attain performance on certain tasks faster than compiled C binaries. – RARE Kpop Manifesto Feb 12 '21 at 00:01

score 0 · Answer 4 · answered Feb 09 '21 at 12:53

With sed (that supports \n in s/ commands).

sed '
    s/$/\n/
    : again
    /\([^\n]*\)\[\([^]]*\)\]\.\[\([^]]*\)\]\([^\n]*\)\n/{
        s//\1\4\n\2.\3\n/
        b again
    }
    s/[^\n]*\n//
    s/\n$//
    s/\n/, /g
'

s/$/\n/ add a newline on the end of read line. Important in case of lines without any regex.
: again define label again that you can go to
/../ - match a regex
- $[^\n]*$ match any non-newline and remember it in \1
- \[$[^]]*$\]\.\[$[^]]*$\] Match [somethign].[something] and remember parts in \2 and \3
- $[^\n]*$ - match any non-newline
/.../{ - when the regex is matched - s// - reuse last regex, ie. the one above - /\1\4\n\2.\3\n/ - shuffle input so that place the non-interesting parts before the newline, and extracted interesting part after the newline - b again - go to again, to match another pattern
s/[^\n]*\n// remove the non-matched part of line
s/\n$// - remove trailing newline
s/\n/, /g separate parts with comma and a space.

Example:

$ sed 's/$/\n/; : again; /\([^\n]*\)\[\([^]]*\)\]\.\[\([^]]*\)\]\([^\n]*\)\n/{ s//\1\4\n\2.\3\n/; b again; }; s/[^\n]*\n//; s/\n$//; s/\n/, /g' <<EOF
the quick brown [fox].[jumps] [over] the lazy dog
the quick brown fox [jumps] [over] the lazy dog
[the].[quick] brown [fox].[jumps] [over] the lazy dog
EOF

outputs:

fox.jumps

the.quick, fox.jumps

If you do not want the empty line in between, then do not output anything from sed in such case where no patterns where found. Add sed -n and on the end of script do not output if empty - /^$/!p, like so:

sed -n 's/$/\n/; : again; /\([^\n]*\)\[\([^]]*\)\]\.\[\([^]]*\)\]\([^\n]*\)\n/{ s//\1\4\n\2.\3\n/; b again; }; s/[^\n]*\n//; s/\n$//; s/\n/, /g; /^$/!p'

score 0 · Answer 5 · answered Feb 09 '21 at 17:26

0

With GNU awk for FPAT (using regexp and gensub function from Ed Morton's code):

awk -v OFS=', ' -v FPAT='[[][^]]*][.][[][^]]*]' '{for (i=1; i<=NF; i++) printf "%s%s", gensub(/[][]/,"","g",$i), (i<NF?OFS:ORS)}' file
fox.jumps
the.quick, fox.jumps

answered Feb 09 '21 at 17:26

Carlos Pascual

581
1
1
6

Extract strings between 2 texts enclosed between squared brackets

5 Answers5