1

I get what it does on higher level... it detects duplicate words. What I am having trouble understanding is the logic of how this works. I am hoping you will correct me if my understanding is off. Other detail is assume that I am using grep on a Linux machine.

  1. \b will detect the first character.
  2. \w+ will scan the letters.

Now for the part I am confused about.

  1. The brackets will "store" the letters up to the first space or really the second \b
  2. then the /1 will repeat steps 1 to 3 and then compare and if they match... display.

I would appreciate layman's terms if possible.

user69001
  • 37
  • 8

3 Answers3

3

(\b\w+) \1\b detects repeated words. For example, abc abc or aaa aaa or x123_ x123_.

A word is a sequence of word character as defined below.

A word character, depending on the mode (ASCII, Locale or Unicode) will match alphabet (can be locale dependent), digits (can be locale dependent) and underscore.

\b detects word boundary, which is a position where you can find a word character before or after (but not both).

There is a slight flaw in the regex above. If the word is repeated 3 times or more, it will only remove half of the repeated words, when replacing with capturing group 1.

nhahtdh
  • 52,949
  • 15
  • 113
  • 149
1

Pattern explanation:

  (                        group and capture to \1:
    \b                       the word boundary
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or more times)
  )                        end of \1
                           ' '
  \1                       what was matched by capture \1
  \b                       the word boundary

If you are using \w that captures a-z, A-Z, 0-9, _ hence you don't need to specify first \b that is used for word boundary.

\1 is back reference that is matched by first groups.

Here parenthesis (...) is used for making groups.

                  (\b\w+) \1\b
First Group ------^^^^^^   ^-------- Match First Group again

Online demo

Braj
  • 44,339
  • 5
  • 51
  • 69
0

\1 is backreference which means it matches with the last capture group. In this case, \b\w is the capture group so \1 matches the last captured group.

More on the backreference can be found here http://www.regular-expressions.info/backref.html

hysh_00
  • 769
  • 5
  • 11