Arbitrary number of capture groups in multiline strings

Question

I have a long, Markdown-formatted string which consists of repeated sections of one or more headers and a multi-line description, like so:

**[Title1](link1) brief description** flag1, flag2
commentary,
occasionally multi-line
---

**[Title2](link2) brief description** flag3, flag4
**[Title3](link3) brief description** flag5, flag6, flag7
commentary
---

...

This order is occasionally broken with other text, interwoven between --- and the next header.

I wish to process it with JS's regex in order to capture the title, link, description and commentary in separate capture groups. Ideally, from the example given I would like to get something like:

1st match:
    group 1: Title1
    group 2: link1
    group 3: brief description
    group 4: commentary,
             occasionally multi-line

2nd match:
    group 1: Title2
    group 2: link2
    group 3: brief description 2
    group 4: Title3
    group 5: link3
    group 6: brief description 3
    group 7: commentary

 ...

I'm not going to lie - my regex skills could use some polishing, however I managed to solve this problem, restricting it to singular headers (using a regex akin to /\*\*\[(.*)\]\((.*)\)\s+(.*)\*\*.*\s+((?:.*\s)*?)?---/g). With an unspecified number of them, I'm not sure how to gather the separate fragments into concise groups, because no matter what I try, I either get separate matches for headers belonging to one item, or the second and subsequent headers get mashed with the commentary.

Is this possible with regex only? I would like to avoid splitting by item boundaries (**[ and --- in this case) and chopping it further from there, because that seems less elegant than a single regex match.

I'm not sure how RegEx words in JS, but in PHP the dot (.) will not match new lines. There is a modifier (s), used like "/match/s" that will tells . to match everything. Since you are using dots, I assume these will not match new lines. — Kohjah Breese, Aug 31 '14 at 10:31
@pushpraj Expected captures are provided in the original post in the second `code` block. Perhaps I should have used _capture_ instead of _group_. Please let me know if that is not what you meant. — Konrad, Aug 31 '14 at 12:08
@KohjahBreese Yes, but `\s` does. The regex given is for when the problem is restricted to one header per item. I'm not sure how to go about matching and capturing when multiple are present. — Konrad, Aug 31 '14 at 12:12
I am not able to test this ATM, but from this part: \*\*.*\s+ ----- the .* will match anything, including a space. If you add a ? to make it: \*\*.?*\s+ it will match anything (or nothing) up to \s. I would recommend, is possible using something like [^\s]* if possible. Dots can lead to problems. — Kohjah Breese, Aug 31 '14 at 16:54

score 1 · Accepted Answer · answered Aug 31 '14 at 16:59

You're trying to repeat a capturing group and then access all of the captures. Unfortunately, that won't work in the JavaScript regex engine (this is true for most of the others too). The .NET engine actually does support it.

I know you didn't want to split first, but that's probably the best option here. If you can somehow use the .NET regex engine from JS or change your project to use .NET/Powershell, then you can probably do it in pure regex.

Reference

Repeating a Capturing Group vs. Capturing a Repeated Group

I see. Bummer then, it's client-side in a web-app. Splitting and processing it is, then. Thanks! — Konrad, Aug 31 '14 at 18:46

Mathieu David · Answer 2 · 2014-08-31T18:16:18.873

1

I think I got it with one regex

var re = /(?:\*\*\[(.*)\]\((.+)\) (.+)\*\* .*\n)(?:([^\*(?:\-\-\)]+))?/g;

I'm not sure it is what you asked for but it matches your input and output. You can play with it here (Regex101 example)

And here you can find a JSFiddle that uses that regex and displays the captured groups.

Of course it is not very strict, so you could have to change it to your needs.

I hope this is what you wanted.

edited Aug 31 '14 at 18:16

answered Aug 31 '14 at 17:49

Mathieu David

3,332
3
14
26

The thing is, for the example provided I wanted it to have only two matches with variable number of captures because that would simplify further processing - each match would correspond to one item and that way I'd have everything I wanted to achieve in a single while loop utilizing `exec()`. I'll have to weigh my options from now and decide whether testing the match for capture count is preferable, or should I just split the thing and hack from there. Thanks! – Konrad Aug 31 '14 at 18:21
Well you could reorganize the matches in a custom array or object in the while loop if it is just the way it is organized that bothers you. But I am not totally sure I understands what you try to avoid ;) – Mathieu David Aug 31 '14 at 18:48
Just wanted to be as concise and elegant as possible. Processing the matches is always a possibility, but it is a possibility I wanted to avoid if getting all the info from a single regex was possible. I know now it isn't. – Konrad Aug 31 '14 at 19:19
Well the info is there, but not formated like you want ;) Good luck with your continuation! – Mathieu David Aug 31 '14 at 19:39

Arbitrary number of capture groups in multiline strings

2 Answers2

Reference