Match everything except a specific string

Question

I have looked at lots of posts with similar title but I have found nothing that works with python or even this site: https://regex101.com

How can I match everything but a specific text?

My text:

1234_This is a text Word AB

Protocol  Address          ping
Internet  1.1.1.1            - 
Internet  1.1.1.2            25 
Internet  1.1.1.3            8 
Internet  1.1.1.4            - 

1234_This is a text Word BCD    
Protocol  Address          ping
Internet  2.2.2.1            10 
Internet  2.2.2.2            -

I want to match Word \w+ and then the rest until the next 1234. So the result should be (return groups marked in ()):

(1234_This is a text (Word AB))(

Protocol  Address          ping
Internet  1.1.1.1            - 
Internet  1.1.1.2            25 
Internet  1.1.1.3            8 
Internet  1.1.1.4            - 

)(1234_This is a text (Word BCD)(    
Protocol  Address          ping
Internet  2.2.2.1            10 
Internet  2.2.2.2            - )

The first part is easy as: matches = re.findall(r'1234_This is a text (Word \w+)', var) But the next part I am unable to achieve. I have tried negative lookahead: ^(?!1234) but then it matches nothing any more...

ctwheels · Accepted Answer · 2017-11-30T16:03:05.413

3

Code

See regex in use here

(1234[\w ]+(Word \w+))((?:(?!1234)[\s\S])*)

Using the s modifier you can use the following.
See regex in use here

(1234[\w ]+(Word \w+))((?:(?!1234).)*)

Explanation

(1234[\w ]+(Word \w+)) Capture the following into capture group 1
- 1234 Match this literally
- [\w ]+ Match one or more word characters or spaces
- (Word \w+) Capture the following into capture group 2
  - Word Match this literally (note the trailing space)
  - \w+ Match any word character one or more times
((?:(?!1234)[\s\S])*) Capture the following into capture group 2
- (?:(?!1234)[\s\S])* Match the following any number of times (tempered greedy token)
  - (?!1234) Negative lookahead ensuring what follows doesn't match
  - [\s\S])* Match any character any number of times

edited Nov 30 '17 at 16:03

answered Nov 30 '17 at 15:52

ctwheels

19,377
6
29
60

Thanks that worked. Wow, this negative lookahead stuff is still really hard to get my head around... – mrCarnivore Nov 30 '17 at 15:54
@mrCarnivore It's basically saying this: at this position in the string, do the next characters match `1234`? If so stop matching, otherwise continue matching. – ctwheels Nov 30 '17 at 15:55
What is the `[\s\S]` for? You can replace it with `.` (matching any char). However, `.*` does not work, although I would expect it to. – mrCarnivore Nov 30 '17 at 15:58
@mrCarnivore you can use `.` if you turn on the single line modifier in regex. `.` doesn't match newline characters, which is why `[\s\S]` is used. `[\s\S]` says to match any whitespace or non-whitespace character (in other terms, match any character). – ctwheels Nov 30 '17 at 16:00
Ok, I think I get it: you specify what you don't want with `(?!1234)`, however in the same capturing group you do also need to specify what you _do_ want. That is why you have to either add `.` or `\s\S`. The quantifier for the whole part comes after the brackets... – mrCarnivore Nov 30 '17 at 16:03
I've added an edit using `.` with the `DOTALL` modifier. If you click on the link I added in the explanation ([tempered greedy token](https://stackoverflow.com/questions/30900794/tempered-greedy-token-what-is-different-about-placing-the-dot-before-the-negat/37343088#37343088)), Wiktor explains well how this method works. You're pretty much correct though, I want to match any character unless it'll be `1234`; then I don't want to match. – ctwheels Nov 30 '17 at 16:04
Thanks. The `.` itself I fully understand (I only didn't get at first that `\s\S` is also an extended version of every char). However, the necessity at this point was unclear to me as I thought that specifying what I do not want with the negative lookahead would be enough to return anything. But I now realize that negative lookahead itself does not suffice. – mrCarnivore Nov 30 '17 at 16:07
1

@mrCarnivore lookaheads and lookbehinds don't actually match a character to consume them: They're basically assertions. This means it'll ensure that at *X* position (whatever *X* represents), ensure *Y* does or does not match (where *Y* is some condition) – ctwheels Nov 30 '17 at 16:11

Aaditya Ura · Answer 2 · 2017-11-30T16:42:49.747

As you stated out:

I want to match Word \w+ and then the rest until the next 1234.

Do you want something like this ?

import re
pattern=r'((1234_This is a text) (Word\s\w+))((\n?.*(?!\n\n))*)'
string="""1234_This is a text Word AB

Protocol  Address          ping
Internet  1.1.1.1            -
Internet  1.1.1.2            25
Internet  1.1.1.3            8
Internet  1.1.1.4            -

1234_This is a text Word BCD
Protocol  Address          ping
Internet  2.2.2.1            10
Internet  2.2.2.2            -"""

match=re.finditer(pattern,string,re.M)
for find in match:
    print("this is group_1 {}".format(find.group(1)))
    print("this is group_3 {}".format(find.group(3)))




    print("this is group_4 {}".format(find.group(4)))

output:

this is group_1 1234_This is a text Word AB
this is group_3 Word AB
this is group_4 

Protocol  Address          ping
Internet  1.1.1.1            -
Internet  1.1.1.2            25
Internet  1.1.1.3            8
Internet  1.1.1.4            
this is group_1 1234_This is a text Word BCD
this is group_3 Word BCD
this is group_4 
Protocol  Address          ping
Internet  2.2.2.1            10
Internet  2.2.2.2            -

No, this is not the result I want. I want the original texts returned and split into different capturing groups (the ones I have marked in my question). Beware: There is also one nested capturing group! — mrCarnivore, Nov 30 '17 at 16:00
Thank you. That also works. However, the other solution is a little more robust as it will also work if there is not a empty line in between the data blocks. — mrCarnivore, Dec 01 '17 at 08:25

Match everything except a specific string

2 Answers2

Code

Explanation