Regexp give extra matching group

Question

I have a content which is text mixed with json

blablabla  bla bla 
sdf
sdfsdfsdf {
    "glossary": [{
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    },
    {
        "val":2
    }]
} dd dfsdfsdf
bla blablablabla

I want get json from the string, so I use this regexp

\{(.|\s)+\}

It gives me (checked it on https://regex101.com/):

Full match with my correctly found json
Empty group

I don't understand what causes the empty group to appear

Use a mere `\{[\s\S]+}`, `(.|\s)+` is a very inefficient pattern, and it gives you the extra group since it is a (repeated) *capturing* group. What is the language you are using in the target environment (not at regex101)? — Wiktor Stribiżew, May 13 '19 at 08:50
1) It is not empty, 2) any capturing parentheses will produce a group. Use non-capturing parentheses (`(?:...)`) to avoid creating the group if you don't need one (or indeed a character class, since your problem allows it). — Amadan, May 13 '19 at 08:52
Then `re.compile(r'\{.+}', re.DOTALL)` is the solution. Of course, `(?s)\{.+}` will work, too. And *Whenever you are using a capturing group, it always returns a submatch* is the direct answer to your question. — Wiktor Stribiżew, May 13 '19 at 10:16

Egan Wolf · Answer 1 · 2019-05-13T09:16:28.910

0

This empty group is last new line symbol captured by \s. Regex101 even shows you a warning that when you use something like this (.)+ regex, only the last occurrence of . is captured as group. ~~You can use non capturing group \{(?:.|\s)+\} to get rid of group or use non capturing group and put second group around quantifier \{((?:.|\s)+)\} to have only one group.~~

Actually, don't do this. Please refer to this comment and comments below.

edited May 13 '19 at 09:16

answered May 13 '19 at 08:54

Egan Wolf

3,385
1
13
26

1

Please never suggest `(?:.|\s)+`, it is a very misfortunate, inefficient pattern that causes crashes, slowdowns, pain (due to excessive backtracking this pattern involves). There are always better alternatives, one is in my top comment to the question. – Wiktor Stribiżew May 13 '19 at 08:54
1

@WiktorStribiżew OK, I will keep that in mind. – Egan Wolf May 13 '19 at 08:56

Regexp give extra matching group

1 Answers1