1

I'm trying to understand a regular expression someone has written in the gsub() function.

I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.

gsub('.*(.{2}$)', '\\1',"my big fluffy cat")

This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.

What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.

The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.

All that does though is output the actual code as the replacement e.g ".{2}$".

Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.

Steve
  • 515
  • 2
  • 4
  • 13

3 Answers3

2

hope the examples can help you to understand it better:

Say we have a string foobarabcabcdef

  • .* matches whole string.

  • .*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc

  • .*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:

    group 0 : whole string
    group 1 : "abc" the first "abc"
    group 2 : "abc" the 2nd "abc"
    group 3 : "def" the last 3 chars
    

So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.

If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.

Kent
  • 173,042
  • 30
  • 210
  • 270
2

Important part of that regular expression are brackets, that's something called "capturing group".

Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.

Raffi
  • 114
  • 7
2

For gsub there are two ways of using the function. The most common way is probably.

gsub("-","TEST","This is a - ")

which would return

This is a TEST

What this does is simply finds the matches in the regular expression and replaces it with the replacement string.

The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...

What this does is looks at the first, second or third capture group in your regular expression.

A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...

Explanation

Your analysis is correct.

What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.

The part in brackets looks for any two characters at the end of the string

The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.

if it helps, check out the result of this expression.

gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
Community
  • 1
  • 1
Sada93
  • 2,382
  • 1
  • 5
  • 18
  • Thanks that makes much more sense now. Although i'm not quite sure about this: Your code outputs: "Group 1: my big fluffy c, Group 2: at" If group 1 is 'my big fluffy c' then i would have thought the output would be: "Group 1: my big fluffy c, Group 2: atat" because it's surely only replacing the 'my big fluffy c' part of the pattern. How come it's replacing the whole pattern if group 1 is only looking at the string before the final two characters? – Steve Feb 18 '19 at 20:06
  • I guess the word "replace" is confusing. When using `\\1` I like to think of it as "extracting" that part of the capture group. It literally extracts the first capture group and the second capture group and puts it in place of `\\1 and \\2` – Sada93 Feb 18 '19 at 20:17
  • code1: gsub('(W[a-e])', 'rn', "Western") code2: gsub('(W[a-e])(.{2}$)', '\\2', "Western") I appear to have confused myself more over this. I would have thought these two lines of code would both output "rnstern" but the second line returns "Western". Group 1 is looking at "We" in the string and this will be replaced by group 2 which is "rn". As that isn't happening i'm not understanding what is happening. – Steve Feb 19 '19 at 20:43
  • @Steve when **no** match is found gsub returns the original string. which is what is happening with `code2`. This will give you rn `gsub('(W[a-e]).*(.{2}$)', '\\2', "Western")`. Code1 is Matching `We` in western and replacing it with `rn` which gives you `rnstern`. – Sada93 Feb 19 '19 at 20:53