1

I have a character string:

temp = '4424396.6\t1\tk__Bacteria\tp__Firmicutes\tc__Erysipelotrichi\to__Erysipelotrichales'

And I need to get rid of tabulations only in between taxonomy terms.

I tried

re.sub(r'(?:\D{1})\t', ',', temp)

It came quite close, but also replaced the letter before tabs:

'4424396.6\t1\tk__Bacteri,p__Firmicute,c__Erysipelotrich,o__Erysipelotrichales'

I am confused as re documentation for (?:...) goes:

...the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

The last letter was within the parenthesis, so how could it be replaced?

PS

I used re.sub(r'(?<=\D{1})(\t)', ',', temp) and it works perfectly fine, but I can't understand what's wrong with the first regexp

lotrus28
  • 608
  • 8
  • 15

2 Answers2

2

The text matched by (?:...) does not form a capture group, as does (...), and therefore cannot be referred to later with a backreference such as \1. However, it's still part of the overall match, and is part of the text that re.sub() will replace.

The point of non-capturing groups is that they are slightly more efficient, and may be required in uses such as re.split() where the mere existence of capturing groups will affect the output.

jasonharper
  • 8,782
  • 2
  • 15
  • 38
1

According to the documentation, (?:...) specifies a non-capturing group. It explains:

Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents.

What this means is that anything that matches the ... expression (in your case, the preceding letter) will not be captured as a group but will still be part of the match. The only thing special about this is that you won't be able to access the part of the input captured by this group using match.group:

Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group

In contrast, (?<=...) is a positive lookbehind assertion; the regular expression will check to make sure any matches are preceded by text matching ..., but won't capture that part.

Tagc
  • 7,701
  • 6
  • 47
  • 99