AWK: Access captured group from line pattern

Question

If I have an awk command

pattern { ... }

and pattern uses a capturing group, how can I access the string so captured in the block?

http://stackoverflow.com/questions/1555173/gnu-awk-accessing-captured-groups-in-replacement-text — lt1776, Jan 12 '11 at 18:12
Sometimes (in simple cases) it's possible to adjust the field separator (`FS`) and pick what one would like to match with a `$field`. Preformatting the input could help too. — Krzysztof Jabłoński, Jul 01 '15 at 17:06
There is a [better answer](http://stackoverflow.com/a/10254791/894885) on the duplicate question. — Samuel Edwin Ward, Jul 08 '15 at 16:04
Samuel Edwin Ward: That's a nice answer too! But it also requires `gawk` (since it uses `gensub`). — rampion, Jul 08 '15 at 17:39

score 371 · Answer 1 · edited Apr 26 '17 at 14:39

371

With gawk, you can use the match function to capture parenthesized groups.

gawk 'match($0, pattern, ary) {print ary[1]}'

example:

echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}'

outputs cd.

Note the specific use of gawk which implements the feature in question.

For a portable alternative you can achieve similar results with match() and substr.

example:

echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

outputs cd.

edited Apr 26 '17 at 14:39

Thor

39,032
10
106
121

answered Jan 12 '11 at 19:49

glenn jackman

207,528
33
187
305

4

Yes, the gxxx variants have lots of additional GNU goodness and power. – Peter Tillemans Jun 23 '11 at 18:33
1

Works in BusyBox awk as well. – MrMas Apr 23 '20 at 14:59

Peter Tillemans · Accepted Answer · 2011-06-23T18:34:21.247

196

That was a stroll down memory lane...

I replaced awk by perl a long time ago.

Apparently the AWK regular expression engine does not capture its groups.

you might consider using something like :

perl -n -e'/test(\d+)/ && print $1'

the -n flag causes perl to loop over every line like awk does.

edited Jun 23 '11 at 18:34

answered Jun 02 '10 at 12:50

Peter Tillemans

33,685
9
76
112

4

Apparently someone disagrees. This web page is from 2005 : http://www.tek-tips.com/faqs.cfm?fid=5674 It confirms that you cannot reuse matched groups in awk. – Peter Tillemans Jun 02 '10 at 13:00
[this article](http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/) seems to agree with you too. – rampion Jun 02 '10 at 13:10
1

As the tek-tips article states, gawk can re-use capture groups. – Dennis Williamson Jun 02 '10 at 14:00
5

I prefer 'perl -n -p -e...' over awk for almost all use cases, since it is more flexible, more powerful and has a saner syntax in my opinion. – Peter Tillemans Jun 23 '11 at 18:39
16

`gawk` != `awk`. They're different tools and `gawk` isn't available by default in most places. – Oli Sep 04 '12 at 12:21
Thanks for the syntax. `&&` and `;` made great differences!! – leesei May 21 '15 at 16:52
7

The OP specifically asked for an awk solution, so I don't think this is an answer. – Joppe Feb 22 '16 at 16:22
10

@Joppe you can't give an awk solution if there is no solution. In line 3 I explain that AWK does not support capturing groups and I gave an alternative, which the OP apparently appreciated because this answer was accepted. How could I better answer this question? – Peter Tillemans Mar 09 '16 at 07:54
@famousgarkin I keep forgetting Perl for the same reasons I still use grep and/or cut instead of awk: I build up long commands incrementally. And I sometimes have some vague idea that it matters that Perl is larger than grep/awk. – android.weasel Oct 20 '17 at 14:02

score 35 · Answer 3 · answered Dec 29 '12 at 20:32

35

This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.

Definition

Add this to your .bash_profile etc.

function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

Usage

Capture regex for each line in file

$ cat filename | regex '.*'

Capture 1st regex capture group for each line in file

$ cat filename | regex '(.*)' 1

answered Dec 29 '12 at 20:32

opsb

26,793
17
85
96

2

How is it different from using `grep -o`? – bfontaine Mar 28 '17 at 14:38
@bfontaine Could `grep -o` output captured groups? – Olle Härstedt Mar 07 '18 at 15:29
1

@OlleHärstedt No it couldn’t. It only covers your use-case when you don’t have capture-groups. In that case it gets ugly with chained `grep -o`'s. – bfontaine Mar 07 '18 at 17:16
this needs support for multiple captures – SgtPooki Feb 12 '21 at 23:13

score 18 · Answer 4 · answered Nov 28 '12 at 03:51

18

You can use GNU awk:

$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]

$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/

answered Nov 28 '12 at 03:51

Isvara

3,170
24
40

5

That's [what glenn jackman's answer says](http://stackoverflow.com/a/4673336/9859), pretty much. – rampion Nov 29 '12 at 13:02
1

Ed Morton: that deserves a top-level answer I'd say. edit: uhm... that prints `RewriteRule (.*) http://www.mysite.net/$` for me, which is more than the subgroup. – rampion Nov 29 '12 at 13:02
3

[Looks like `RSTART` and `RLENGTH` refer to the substring matched by the pattern](http://www.grymoire.com/Unix/Awk.html#uh-47) – rampion Nov 29 '12 at 13:10
@EdMorton - no, that will select the whole line that contains `http...` pattern – KFL Dec 24 '20 at 06:54
@KFL you're right but actually there's a worse problem that the posted answer (and my suggestion to make it not gawk-specific) both contain `.*?` which is a PCRE-ism and undefined behavior in an ERE. I'll delete my comment. – Ed Morton Dec 24 '20 at 13:41

score 6 · Answer 5 · answered Mar 21 '12 at 01:58

6

You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:

step 1. use gensub to surround matches with some character that doesnt appear in your string. step 2. Use split against the character. step 3. Every other element in the splitted array is your capture group.

$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad

answered Mar 21 '12 at 01:58

ydrol

129
1
3

3

I'm almost certain that `gensub` is a `gawk` specific function. What do you get from your awk if you type `awk --version` ;-?). Good luck to all. – shellter Apr 13 '12 at 05:28
6

I'm fully certain that gensub is a gawk-ism, though BusyBox awk also has it. This answer could also be implemented using gsub, though: `echo 'ab cb ad' | awk '{gsub(/a./,SUBSEP"&"SUBSEP);split($0,cap,SUBSEP);print cap[2]"|"cap[4]}'` – dubiousjim Apr 19 '12 at 01:05
3

gensub() is a gawk extension, gawk's manual clearly say so. Other awk variants may also implement it, but it is still not POSIX. Try gawk --posix '{gsub(...)}' and it will complain – MestreLion Apr 21 '12 at 05:19
2

@MestreLion, you mean it will complain for `gawk --posix '{gensub(...)}'`. – dubiousjim Apr 24 '12 at 00:08
@dubiousjim: oops, yes, `gensub()`, sorry for the typo – MestreLion Apr 24 '12 at 02:25
1

Despite you were wrong about **POSIX awk** having the `gensub` function, your example applied to a very limited scenario: the whole pattern is grouped, it can't match something like all `key=(value)` when I want to extract only the `value` parts. – Meow Sep 24 '15 at 13:24
1

Enough people have commented about "gensub is a gawk-ism". Why not edit your answer at least? – Juan May 26 '18 at 01:23

score 2 · Answer 6 · answered Aug 03 '16 at 19:16

2

I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:

function regex { perl -n -e "/$1/ && printf \"%s\n\", "'$1' }

I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.

'([0-9]*)ms$'

answered Aug 03 '16 at 19:16

wytten

2,111
1
18
33

I prefer this solution, since you can see the parts of the group that delimit the capture, while also omitting them. However, could someone elxplain how this works? I can't get this perl syntax to work properly in BASH, because I don't understand it very well - especially the double/single-quote marks around `$1` – Demis Dec 19 '17 at 18:39
It is not something I have done before or since, but looking back what it is doing is concatenating two strings, the first string being in double quotes (this first string contains embedded double quotes escaped with backslash) and the second string being in single quotes. Then the result of that concatenation is supplied as argument to perl -e. Also you need to know that the first $1 (the one within double quotes) is substituted with the first argument to the function, while the second $1 (the one within single quotes) is left untouched. See [this example](https://i.imgur.com/Bfp2TmA.png) – wytten Dec 19 '17 at 23:01
I see, that's making a bit more sense now. So where in the perl command is the regex match/group capture definition? I see you wrote `'([0-9]*)ms$'` - is that supplied as an argument (and the string another argument)? And the output from `perl -e` is being inserted into bash's `printf` command then, to replace `%s`, is that right? Thanks, I am hoping to use this. – Demis Dec 20 '17 at 23:55
1

You pass a regular expression enclosed in single quotes as the sole argument to the regex bash function. [Example](https://i.imgur.com/71UKj52.png) – wytten Dec 21 '17 at 13:51

AWK: Access captured group from line pattern

6 Answers6

Definition

Usage

Linked

Related