regex for matching programming language like block

Question

I'm trying to make a small scripting language using c# currently doing a block parser im stuck at making regex for block. Blocks can have ∞ times of sub blocks

This is what i need to catch

{ 
    naber(); 
}
{
    int x = 5;
    x = 2;
    if (x == 5) {
        x = 5;
    }
}

I tried this but not working

\{[^{}]*|(\{[^\{\}]\})*\}

This is my first post please have mercy on me

Why are you using a regex instead of writing an actual parser with something like [Sprache](https://github.com/sprache/Sprache)? — stuartd, Jun 02 '20 at 21:53
Unfortunately programming languages (at least the types you're trying to make) aren't regular so you can't parse them with regular expressions. At one point or another you'll hit a wall and have to use or write a real parser. — Guy Incognito, Jun 02 '20 at 21:53
You can't have one Regex expression for you block. If you were doing this in Linux I would recommend YACC (Yet Another Compiler Compiler) which uses LEX (similar to REGEX) for expressions. You expressions are 1) "int x = 5" 2) "x = 2" 3) x == 5 4) x = 5 5) If( ) A compiler you have to define your language and your expressions. You can't do both in one structure. — jdweng, Jun 02 '20 at 21:56
#Guy Incognito : The expressions are regular. The OP is missing the language syntax. — jdweng, Jun 02 '20 at 21:57
@jdweng "The expressions are regular."? Totally don't get your comment. The Guy's comment seem to be about "regular grammar" vs. "context free grammar" (to my understanding, like https://stackoverflow.com/a/1758162/477420.) Not really sure what your comment mean. — Alexei Levenkov, Jun 02 '20 at 22:02
The OP want to build a COMPILER. Expressions are only a small part of a compiler. — jdweng, Jun 02 '20 at 22:12
Well, if `{(?>[^{}]+|(?){|(?)})*}` solves the issue for your "language", then use it. If the `{` and `}` can appear in any other context than function/method body delimiters, you can't rely on the balanced construct and you will have to write a parser. — Wiktor Stribiżew, Jun 02 '20 at 23:33

score 2 · Accepted Answer · answered Jun 02 '20 at 22:18

Regex will not help you for this. If you are designing a scripting language, possibly to be executed, that has blocks and sub-blocks, you need context-free grammar as opposed to regular grammar which can be expressed through regular expressions.

To interpret a context-free language you need the following steps (simplified):

Convert the code string to a list of tokens/symbols. This process is done by a component usually called Lexer.
Convert the tokens into a structured tree (AST - Abstract Syntax Tree) based on grammar rules (things like operator precedence, nested code blocks, etc). This is done by a component usually called Parser.
From here several options arise, either you translate the AST into native code, or intermediate code (like bytecode) or transpile it into another language; Or you can run it directly in memory, the most simple approach and probably what you want/need.

These should already be plenty of concepts to search for, but all of this can be achieved easily with tools like ANTLR. There might be alternatives to ANTLR obviously, I just don’t recall any just now.

score -1 · Answer 2 · answered Jun 03 '20 at 01:30

I agree with those saying that regex isn't what you should use parsing code. With that said, it is possible on some reg engines to match characters and get code in a block.

This might work for you {((?>[^{}]+|(?R))*)}. If the regex engine supports recursive pattern then it is possible to do some work parsing code.

More here about it Match balanced curly braces

regex for matching programming language like block

2 Answers2