0

I try to use the same regular expression pattern in Python and Bash, but it is not working in the same way, when the pattern has a parenthesis in it. For exemple, I try to look at every "the" in an article. Here is the pattern

"(^|[^a-zA-Z])[Tt]he($|[^a-zA-Z])" 

In Bash, I write

egrep --color "(^|[^a-zA-Z])[Tt]he($|[^a-zA-Z])" Untitled.txt 

But in Python, if I do this:

pattern=re.compile(r"(^|[^a-zA-Z])[Tt]he($|[^a-zA-Z])")
re.findall(pattern, text)

I will have the result of what it matches only inside the parenthesis, while I want the "the".

If I want to have the same effet like Bash, I have to repeat:

pattern=re.compile(r"^[Tt]he|[^a-zA-Z][Tt]he|[Tt]he$|[Tt]he[^a-zA-Z]")

So, my question is, is there any difference about parenthesis in Python and in Bash ?

holdenweb
  • 24,217
  • 7
  • 45
  • 67
Pansa
  • 1
  • 1
  • This has nothing to do with bash. It's about python vs. egrep. – melpomene Aug 03 '19 at 15:22
  • 1
    Why so complicated? Wouldn't `grep '\'` just work? – melpomene Aug 03 '19 at 15:24
  • 1
    Arguably, the parentheses define capture groups in `grep` and Python. The difference is that `egrep` ignores the capture groups (unless options like `-o` or `--color` are used), while `findall` returns *only* the contents of the capture groups. – chepner Aug 03 '19 at 15:24
  • @melpomene I think `\b` would be better, since OP wanted to exclude only letters, not other symbols.. – joanis Aug 03 '19 at 15:50
  • @joanis What do you think is the difference between `\` and `\b`? – melpomene Aug 03 '19 at 15:53
  • @Pansa, maybe you want to use grouping parentheses around the part you want to keep, with and non-saving ones outside: `(?:...)(...)(?:...)` will save the results of the middle set of parens, but not the two sets with `(?:...)`. So this would mean `(?:^|[^a-zA-Z])([Tt]he)(?:$|[^a-zA-Z])`. Or, much simpler, as melpomene said `\'`. – joanis Aug 03 '19 at 15:54
  • @melpomene in my tests, I thought `grep '\bthe\b'` matched "the.", while `grep '\'` did not, but I just tested again and I must have misread the output, because they both match. My apologies. – joanis Aug 03 '19 at 15:55
  • 1
    The capturing groups themselves work the same, just the Python [`re.findall`](https://docs.python.org/2/library/re.html#re.findall) behaves in its own way: *If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.* – Wiktor Stribiżew Aug 03 '19 at 18:33

0 Answers0