255

If I have an awk command

pattern { ... }

and pattern uses a capturing group, how can I access the string so captured in the block?

ira wati
  • 79
  • 10
rampion
  • 82,104
  • 41
  • 185
  • 301

6 Answers6

371

With gawk, you can use the match function to capture parenthesized groups.

gawk 'match($0, pattern, ary) {print ary[1]}' 

example:

echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}' 

outputs cd.

Note the specific use of gawk which implements the feature in question.

For a portable alternative you can achieve similar results with match() and substr.

example:

echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

outputs cd.

Thor
  • 39,032
  • 10
  • 106
  • 121
glenn jackman
  • 207,528
  • 33
  • 187
  • 305
196

That was a stroll down memory lane...

I replaced awk by perl a long time ago.

Apparently the AWK regular expression engine does not capture its groups.

you might consider using something like :

perl -n -e'/test(\d+)/ && print $1'

the -n flag causes perl to loop over every line like awk does.

Peter Tillemans
  • 33,685
  • 9
  • 76
  • 112
  • 4
    Apparently someone disagrees. This web page is from 2005 : http://www.tek-tips.com/faqs.cfm?fid=5674 It confirms that you cannot reuse matched groups in awk. – Peter Tillemans Jun 02 '10 at 13:00
  • [this article](http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/) seems to agree with you too. – rampion Jun 02 '10 at 13:10
  • 1
    As the tek-tips article states, gawk can re-use capture groups. – Dennis Williamson Jun 02 '10 at 14:00
  • 5
    I prefer 'perl -n -p -e...' over awk for almost all use cases, since it is more flexible, more powerful and has a saner syntax in my opinion. – Peter Tillemans Jun 23 '11 at 18:39
  • 16
    `gawk` != `awk`. They're different tools and `gawk` isn't available by default in most places. – Oli Sep 04 '12 at 12:21
  • Thanks for the syntax. `&&` and `;` made great differences!! – leesei May 21 '15 at 16:52
  • 7
    The OP specifically asked for an awk solution, so I don't think this is an answer. – Joppe Feb 22 '16 at 16:22
  • 10
    @Joppe you can't give an awk solution if there is no solution. In line 3 I explain that AWK does not support capturing groups and I gave an alternative, which the OP apparently appreciated because this answer was accepted. How could I better answer this question? – Peter Tillemans Mar 09 '16 at 07:54
  • @famousgarkin I keep forgetting Perl for the same reasons I still use grep and/or cut instead of awk: I build up long commands incrementally. And I sometimes have some vague idea that it matters that Perl is larger than grep/awk. – android.weasel Oct 20 '17 at 14:02
35

This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.

Definition

Add this to your .bash_profile etc.

function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

Usage

Capture regex for each line in file

$ cat filename | regex '.*'

Capture 1st regex capture group for each line in file

$ cat filename | regex '(.*)' 1
opsb
  • 26,793
  • 17
  • 85
  • 96
18

You can use GNU awk:

$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]

$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/
Isvara
  • 3,170
  • 24
  • 40
  • 5
    That's [what glenn jackman's answer says](http://stackoverflow.com/a/4673336/9859), pretty much. – rampion Nov 29 '12 at 13:02
  • 1
    Ed Morton: that deserves a top-level answer I'd say. edit: uhm... that prints `RewriteRule (.*) http://www.mysite.net/$` for me, which is more than the subgroup. – rampion Nov 29 '12 at 13:02
  • 3
    [Looks like `RSTART` and `RLENGTH` refer to the substring matched by the pattern](http://www.grymoire.com/Unix/Awk.html#uh-47) – rampion Nov 29 '12 at 13:10
  • @EdMorton - no, that will select the whole line that contains `http...` pattern – KFL Dec 24 '20 at 06:54
  • @KFL you're right but actually there's a worse problem that the posted answer (and my suggestion to make it not gawk-specific) both contain `.*?` which is a PCRE-ism and undefined behavior in an ERE. I'll delete my comment. – Ed Morton Dec 24 '20 at 13:41
6

You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:

step 1. use gensub to surround matches with some character that doesnt appear in your string. step 2. Use split against the character. step 3. Every other element in the splitted array is your capture group.

$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad
ydrol
  • 129
  • 1
  • 3
  • 3
    I'm almost certain that `gensub` is a `gawk` specific function. What do you get from your awk if you type `awk --version` ;-?). Good luck to all. – shellter Apr 13 '12 at 05:28
  • 6
    I'm fully certain that gensub is a gawk-ism, though BusyBox awk also has it. This answer could also be implemented using gsub, though: `echo 'ab cb ad' | awk '{gsub(/a./,SUBSEP"&"SUBSEP);split($0,cap,SUBSEP);print cap[2]"|"cap[4]}'` – dubiousjim Apr 19 '12 at 01:05
  • 3
    gensub() is a gawk extension, gawk's manual clearly say so. Other awk variants may also implement it, but it is still not POSIX. Try gawk --posix '{gsub(...)}' and it will complain – MestreLion Apr 21 '12 at 05:19
  • 2
    @MestreLion, you mean it will complain for `gawk --posix '{gensub(...)}'`. – dubiousjim Apr 24 '12 at 00:08
  • @dubiousjim: oops, yes, `gensub()`, sorry for the typo – MestreLion Apr 24 '12 at 02:25
  • 1
    Despite you were wrong about **POSIX awk** having the `gensub` function, your example applied to a very limited scenario: the whole pattern is grouped, it can't match something like all `key=(value)` when I want to extract only the `value` parts. – Meow Sep 24 '15 at 13:24
  • 1
    Enough people have commented about "gensub is a gawk-ism". Why not edit your answer at least? – Juan May 26 '18 at 01:23
2

I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:

function regex { perl -n -e "/$1/ && printf \"%s\n\", "'$1' }

I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.

'([0-9]*)ms$'
wytten
  • 2,111
  • 1
  • 18
  • 33
  • I prefer this solution, since you can see the parts of the group that delimit the capture, while also omitting them. However, could someone elxplain how this works? I can't get this perl syntax to work properly in BASH, because I don't understand it very well - especially the double/single-quote marks around `$1` – Demis Dec 19 '17 at 18:39
  • It is not something I have done before or since, but looking back what it is doing is concatenating two strings, the first string being in double quotes (this first string contains embedded double quotes escaped with backslash) and the second string being in single quotes. Then the result of that concatenation is supplied as argument to perl -e. Also you need to know that the first $1 (the one within double quotes) is substituted with the first argument to the function, while the second $1 (the one within single quotes) is left untouched. See [this example](https://i.imgur.com/Bfp2TmA.png) – wytten Dec 19 '17 at 23:01
  • I see, that's making a bit more sense now. So where in the perl command is the regex match/group capture definition? I see you wrote `'([0-9]*)ms$'` - is that supplied as an argument (and the string another argument)? And the output from `perl -e` is being inserted into bash's `printf` command then, to replace `%s`, is that right? Thanks, I am hoping to use this. – Demis Dec 20 '17 at 23:55
  • 1
    You pass a regular expression enclosed in single quotes as the sole argument to the regex bash function. [Example](https://i.imgur.com/71UKj52.png) – wytten Dec 21 '17 at 13:51