0

I'm trying to capture everything between two markers (START[number] and END[number]) where number also needs to be extracted. I need to support line breaks.

For instance the following :

START[1] 
message to capture ...END[1] 

must return :

  1. group 1 : 1
  2. group 2 : message to capture ...

  3. group 3 : 1

Here is my attempt (demo):

START\[(\d+)\]((.|\n|\r)*?)END\[(\d+)\]

It doesn't work as I obtain a third group made of the last character of the message to capture and I don't know why.

Can someone may help me with this ? Thanks.

Malick
  • 4,452
  • 1
  • 35
  • 43
  • You simply have too many capturing groups. Use a non-capturing group inside: `START\[(\d+)\]((?:.|\n|\r)*?)END\[(\d+)\]` – ASDFGerte Nov 29 '19 at 13:36
  • NEVER use `(.|\n|\r)*?`. Use dedicated patterns to match any char, like `[^]` in JS. Or a workaround like `[\s\S]` / `[\d\D]` / `[\w\W]` – Wiktor Stribiżew Nov 29 '19 at 13:39
  • It's true, that `(.|\n|\r)*?` is not a good way to handle things, e.g. `[\s\S]` or the dotall flag `s` (not widely supported yet) are better solutions, however, i feel like the problem stems from using brackets, when a capturing group is not intended, and not using a non-capturing group. – ASDFGerte Nov 29 '19 at 13:43

1 Answers1

1

Use [\s\S] instead of (.|\n|\r)

START\[(\d+)\]([\s\S]+?)END\[(\d+)\]

Demo

To be sure to have the same number in START and END, use a backreference to group 1:

(Credit to Aaron de Windt in comment)

START\[(\d+)\]([\s\S]+?)END\[(\1)\]
Toto
  • 83,193
  • 59
  • 77
  • 109
  • 2
    You can also swap the second "(\d+)" with a "\1" so it only matches if the two number are the same. "START\[(\d+)\]([\s\S]+?)END\[\1\]" – Aaron de Windt Nov 29 '19 at 13:50