Regex to match all the strings between two identical strings

Question

E.g. I have this string -- This -- is -- one -- another -- comment -- I want the matched elements to be "This", "is", "one", "another", and "comment"

I was trying this regex --\s+([^--]+)\s+-- which gives me the matched elements as "This", "one" and "comment"

I have searched for other problems, they all provide solution like this i.e. #A# and I will get A but for #A#B# also I get A, but in this case I want both the elements A and B as both of them are between two # chars.

I am testing it for javascript regex, but I think solution should be irrespective of platform/language.

Try this `--\s+([^--]+)\s+`, and then remove last two -- manually: http://www.regexr.com/3fffo — Piyush Kumar Baliyan, Mar 07 '17 at 09:56
This `[^--]+` will prevent matching `mind-breaking` in `-- mind-breaking -- ` — Wiktor Stribiżew, Mar 07 '17 at 09:57

Wiktor Stribiżew · Accepted Answer · 2019-09-23T08:39:17.223

1

In general, you need to use a pattern like

STRING([\s\S]*?)(?=STRING|$)

It will match STRING, then capture into Group 1 any zero or more chars, as few as possible, up to the first occurrence of STRING *stopping right before this word** because the (?=...) is a positive lookahead that, being a zero-width assertion, does not consume matched text or end of string.

A generic variation of the pattern is

STRING((?:(?!STRING)[\s\S])*)

It uses a tempered greedy token, (?:(?!STRING)[\s\S])*, that matches any char, 0 or more occurrences, that does not start a STRING char sequence.

To get all the substrings in the current solution, use a lookahead like

/--\s+([\s\S]*?)(?=\s+--)/g
                ^^^^^^^^^

See the regex demo.

Note that [^--]+ matches 1 or more symbols other than a -, it does not match any text that is not equal to --. [...] is a character class that matches a single character. To match any text of any length from one char up to the first occurrence of a pattern, you can rely on a [\s\S]*? construct: any 0+ chars, as few as possible (due to the lazy *? quantifier).

JS demo:

var s = '-- This -- is -- one -- another -- comment --';
var rx = /--\s+([\s\S]*?)(?=\s+--)/g;
var m, res=[];
while (m = rx.exec(s)) {
  res.push(m[1]);
}
console.log(res);

edited Sep 23 '19 at 08:39

answered Mar 07 '17 at 09:54

Wiktor Stribiżew

484,719
26
302
397

Accepted the answer ! Thanks for solution with positive look ahead. – Sachin G. Mar 07 '17 at 10:04
Lookarounds are the natural way to match substrings that overlap. Also, the `[\s\S]` that matches any char can be replaced with the native JS regex `[^]` construct (*not nothing*), but it is not portable. `[\s\S]` will work almost everywhere. – Wiktor Stribiżew Mar 07 '17 at 10:05
i wonder why you used `/--\s+([\s\S]*?)(?=\s+--)/g` instead of `/--\s+([\s\S]*?)\s+(?=--)/g` is there any performace reason or it's just aesthetics? – Maciej Kozieja Mar 07 '17 at 10:23
@MaciejKozieja: I do not think this is critical here, but that is interesting. Surely, the `\s+` can be moved outside the lookahead. If there is a real performance difference (the number of steps at regex101 does not actually prove any regex is better than another), a test in JS environment should be set up. See https://jsfiddle.net/pbL0cmsj/1/ - almost no difference in performance. – Wiktor Stribiżew Mar 07 '17 at 10:26
seams like [`/--\s+([\s\S]*?)(?=\s+--)/g` is faster](https://jsperf.com/regex-speed-comparison22) – Maciej Kozieja Mar 07 '17 at 10:32

score 0 · Answer 2 · edited Mar 08 '21 at 14:11

0

To read all I would use positive look ahead:

const data = '-- This -- is -- one -- another -- comment --'

const readAll = data => {
  const regex =/--\s*(.*?)\s*(?=--)/g
  const found = []
  let temp
  while (temp = regex.exec(data)) {
    found.push(temp[1])
  }
  return found
}

console.log(readAll(data))

And to remove comments just do this:

const data = `-- This -- is -- one -- another -- comment -- this is not a comment`.replace(/--.*--/g, '')

console.log(data)

edited Mar 08 '21 at 14:11

Locco0_0

2,768
5
24
32

answered Mar 07 '17 at 09:57

Maciej Kozieja

1,532
1
8
28

Note that OP's `[^--]` matches line break chars, while your `.*?` will not. – Wiktor Stribiżew Mar 07 '17 at 10:01
If you want line break you can do (?:\n|.)*? because [^--] dont allow for usage of one - because it is not - character second dash means until – Maciej Kozieja Mar 07 '17 at 10:03
Maciek, never use `(?:\n|.)*?`. It will crash your browser one day. See my answer for a proper way to match any char with a JS regex. – Wiktor Stribiżew Mar 07 '17 at 10:04
So you use `[\s\S]` to get all characters including new lines etc thats clever :D – Maciej Kozieja Mar 07 '17 at 10:06
See more about that [here](http://stackoverflow.com/a/36006948/3832970). – Wiktor Stribiżew Mar 07 '17 at 10:07

Regex to match all the strings between two identical strings

2 Answers2

Linked

Related