0

I have a content which is text mixed with json

blablabla  bla bla 
sdf
sdfsdfsdf {
    "glossary": [{
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    },
    {
        "val":2
    }]
} dd dfsdfsdf
bla blablablabla

I want get json from the string, so I use this regexp

\{(.|\s)+\}

It gives me (checked it on https://regex101.com/):

  • Full match with my correctly found json
  • Empty group

I don't understand what causes the empty group to appear

amplifier
  • 1,521
  • 1
  • 16
  • 34
  • 1
    Use a mere `\{[\s\S]+}`, `(.|\s)+` is a very inefficient pattern, and it gives you the extra group since it is a (repeated) *capturing* group. What is the language you are using in the target environment (not at regex101)? – Wiktor Stribiżew May 13 '19 at 08:50
  • 1) It is not empty, 2) any capturing parentheses will produce a group. Use non-capturing parentheses (`(?:...)`) to avoid creating the group if you don't need one (or indeed a character class, since your problem allows it). – Amadan May 13 '19 at 08:52
  • @WiktorStribiżew I use python – amplifier May 13 '19 at 10:14
  • 1
    Then `re.compile(r'\{.+}', re.DOTALL)` is the solution. Of course, `(?s)\{.+}` will work, too. And *Whenever you are using a capturing group, it always returns a submatch* is the direct answer to your question. – Wiktor Stribiżew May 13 '19 at 10:16

1 Answers1

0

This empty group is last new line symbol captured by \s. Regex101 even shows you a warning that when you use something like this (.)+ regex, only the last occurrence of . is captured as group. You can use non capturing group \{(?:.|\s)+\} to get rid of group or use non capturing group and put second group around quantifier \{((?:.|\s)+)\} to have only one group.

Actually, don't do this. Please refer to this comment and comments below.

Egan Wolf
  • 3,385
  • 1
  • 13
  • 26
  • 1
    Please never suggest `(?:.|\s)+`, it is a very misfortunate, inefficient pattern that causes crashes, slowdowns, pain (due to excessive backtracking this pattern involves). There are always better alternatives, one is in my top comment to the question. – Wiktor Stribiżew May 13 '19 at 08:54
  • 1
    @WiktorStribiżew OK, I will keep that in mind. – Egan Wolf May 13 '19 at 08:56