Not sure if I understand the regex: (\b\w+) \1\b?

Question

I get what it does on higher level... it detects duplicate words. What I am having trouble understanding is the logic of how this works. I am hoping you will correct me if my understanding is off. Other detail is assume that I am using grep on a Linux machine.

\b will detect the first character.
\w+ will scan the letters.

Now for the part I am confused about.

The brackets will "store" the letters up to the first space or really the second \b
then the /1 will repeat steps 1 to 3 and then compare and if they match... display.

I would appreciate layman's terms if possible.

nhahtdh · Accepted Answer · 2014-10-10T04:11:04.190

3

(\b\w+) \1\b detects repeated words. For example, abc abc or aaa aaa or x123_ x123_.

A word is a sequence of word character as defined below.

A word character, depending on the mode (ASCII, Locale or Unicode) will match alphabet (can be locale dependent), digits (can be locale dependent) and underscore.

\b detects word boundary, which is a position where you can find a word character before or after (but not both).

There is a slight flaw in the regex above. If the word is repeated 3 times or more, it will only remove half of the repeated words, when replacing with capturing group 1.

edited Oct 10 '14 at 04:11

answered Oct 10 '14 at 04:04

nhahtdh

52,949
15
113
149

consider adding space before the second part. – Avinash Raj Oct 10 '14 at 04:05
OK got it. The space was not very visible in the title. – nhahtdh Oct 10 '14 at 04:08
+1 for nice explanation. – Braj Oct 10 '14 at 04:19

Braj · Answer 2 · 2014-10-10T04:17:50.197

1

Pattern explanation:

  (                        group and capture to \1:
    \b                       the word boundary
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or more times)
  )                        end of \1
                           ' '
  \1                       what was matched by capture \1
  \b                       the word boundary

If you are using \w that captures a-z, A-Z, 0-9, _ hence you don't need to specify first \b that is used for word boundary.

\1 is back reference that is matched by first groups.

Here parenthesis (...) is used for making groups.

                  (\b\w+) \1\b
First Group ------^^^^^^   ^-------- Match First Group again

Online demo

edited Oct 10 '14 at 04:17

answered Oct 10 '14 at 04:10

Braj

44,339
5
51
69

`\b` is not needed if `\w` is already used. – Braj Oct 10 '14 at 04:15
If we are interested in words then yes second `\b` is needed. – Braj Oct 10 '14 at 04:16

score 0 · Answer 3 · answered Oct 10 '14 at 04:03

0

\1 is backreference which means it matches with the last capture group. In this case, \b\w is the capture group so \1 matches the last captured group.

More on the backreference can be found here http://www.regular-expressions.info/backref.html

answered Oct 10 '14 at 04:03

hysh_00

769
5
11

*\1 matches the last captured group* Is it correct? – Braj Oct 10 '14 at 04:21
I mean the 1st captured group. In this case, the last word. – hysh_00 Oct 10 '14 at 04:29

Not sure if I understand the regex: (\b\w+) \1\b?

3 Answers3