2

I have looked at lots of posts with similar title but I have found nothing that works with python or even this site: https://regex101.com

How can I match everything but a specific text?

My text:

1234_This is a text Word AB

Protocol  Address          ping
Internet  1.1.1.1            - 
Internet  1.1.1.2            25 
Internet  1.1.1.3            8 
Internet  1.1.1.4            - 

1234_This is a text Word BCD    
Protocol  Address          ping
Internet  2.2.2.1            10 
Internet  2.2.2.2            - 

I want to match Word \w+ and then the rest until the next 1234. So the result should be (return groups marked in ()):

(1234_This is a text (Word AB))(

Protocol  Address          ping
Internet  1.1.1.1            - 
Internet  1.1.1.2            25 
Internet  1.1.1.3            8 
Internet  1.1.1.4            - 

)(1234_This is a text (Word BCD)(    
Protocol  Address          ping
Internet  2.2.2.1            10 
Internet  2.2.2.2            - )

The first part is easy as: matches = re.findall(r'1234_This is a text (Word \w+)', var) But the next part I am unable to achieve. I have tried negative lookahead: ^(?!1234) but then it matches nothing any more...

mrCarnivore
  • 2,992
  • 1
  • 9
  • 25

2 Answers2

3

Code

See regex in use here

(1234[\w ]+(Word \w+))((?:(?!1234)[\s\S])*)

Using the s modifier you can use the following.
See regex in use here

(1234[\w ]+(Word \w+))((?:(?!1234).)*)

Explanation

  • (1234[\w ]+(Word \w+)) Capture the following into capture group 1
    • 1234 Match this literally
    • [\w ]+ Match one or more word characters or spaces
    • (Word \w+) Capture the following into capture group 2
      • Word Match this literally (note the trailing space)
      • \w+ Match any word character one or more times
  • ((?:(?!1234)[\s\S])*) Capture the following into capture group 2
    • (?:(?!1234)[\s\S])* Match the following any number of times (tempered greedy token)
      • (?!1234) Negative lookahead ensuring what follows doesn't match
      • [\s\S])* Match any character any number of times
ctwheels
  • 19,377
  • 6
  • 29
  • 60
  • Thanks that worked. Wow, this negative lookahead stuff is still really hard to get my head around... – mrCarnivore Nov 30 '17 at 15:54
  • @mrCarnivore It's basically saying this: at this position in the string, do the next characters match `1234`? If so stop matching, otherwise continue matching. – ctwheels Nov 30 '17 at 15:55
  • What is the `[\s\S]` for? You can replace it with `.` (matching any char). However, `.*` does not work, although I would expect it to. – mrCarnivore Nov 30 '17 at 15:58
  • @mrCarnivore you can use `.` if you turn on the single line modifier in regex. `.` doesn't match newline characters, which is why `[\s\S]` is used. `[\s\S]` says to match any whitespace or non-whitespace character (in other terms, match any character). – ctwheels Nov 30 '17 at 16:00
  • Ok, I think I get it: you specify what you don't want with `(?!1234)`, however in the same capturing group you do also need to specify what you _do_ want. That is why you have to either add `.` or `\s\S`. The quantifier for the whole part comes after the brackets... – mrCarnivore Nov 30 '17 at 16:03
  • I've added an edit using `.` with the `DOTALL` modifier. If you click on the link I added in the explanation ([tempered greedy token](https://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat/37343088#37343088)), Wiktor explains well how this method works. You're pretty much correct though, I want to match any character unless it'll be `1234`; then I don't want to match. – ctwheels Nov 30 '17 at 16:04
  • Thanks. The `.` itself I fully understand (I only didn't get at first that `\s\S` is also an extended version of every char). However, the necessity at this point was unclear to me as I thought that specifying what I do not want with the negative lookahead would be enough to return anything. But I now realize that negative lookahead itself does not suffice. – mrCarnivore Nov 30 '17 at 16:07
  • 1
    @mrCarnivore lookaheads and lookbehinds don't actually match a character to consume them: They're basically assertions. This means it'll ensure that at *X* position (whatever *X* represents), ensure *Y* does or does not match (where *Y* is some condition) – ctwheels Nov 30 '17 at 16:11
1

As you stated out:

I want to match Word \w+ and then the rest until the next 1234.

Do you want something like this ?

import re
pattern=r'((1234_This is a text) (Word\s\w+))((\n?.*(?!\n\n))*)'
string="""1234_This is a text Word AB

Protocol  Address          ping
Internet  1.1.1.1            -
Internet  1.1.1.2            25
Internet  1.1.1.3            8
Internet  1.1.1.4            -

1234_This is a text Word BCD
Protocol  Address          ping
Internet  2.2.2.1            10
Internet  2.2.2.2            -"""

match=re.finditer(pattern,string,re.M)
for find in match:
    print("this is group_1 {}".format(find.group(1)))
    print("this is group_3 {}".format(find.group(3)))




    print("this is group_4 {}".format(find.group(4)))

output:

this is group_1 1234_This is a text Word AB
this is group_3 Word AB
this is group_4 

Protocol  Address          ping
Internet  1.1.1.1            -
Internet  1.1.1.2            25
Internet  1.1.1.3            8
Internet  1.1.1.4            
this is group_1 1234_This is a text Word BCD
this is group_3 Word BCD
this is group_4 
Protocol  Address          ping
Internet  2.2.2.1            10
Internet  2.2.2.2            -
Aaditya Ura
  • 9,140
  • 4
  • 35
  • 62
  • No, this is not the result I want. I want the original texts returned and split into different capturing groups (the ones I have marked in my question). Beware: There is also one nested capturing group! – mrCarnivore Nov 30 '17 at 16:00
  • Thank you. That also works. However, the other solution is a little more robust as it will also work if there is not a empty line in between the data blocks. – mrCarnivore Dec 01 '17 at 08:25