-2

I'm trying to make a small scripting language using c# currently doing a block parser im stuck at making regex for block. Blocks can have ∞ times of sub blocks

This is what i need to catch

{ 
    naber(); 
}
{
    int x = 5;
    x = 2;
    if (x == 5) {
        x = 5;
    }
}

I tried this but not working

\{[^{}]*|(\{[^\{\}]\})*\}

This is my first post please have mercy on me

slowcheet4h
  • 9
  • 1
  • 1
  • Why are you using a regex instead of writing an actual parser with something like [Sprache](https://github.com/sprache/Sprache)? – stuartd Jun 02 '20 at 21:53
  • 1
    Unfortunately programming languages (at least the types you're trying to make) aren't regular so you can't parse them with regular expressions. At one point or another you'll hit a wall and have to use or write a real parser. – Guy Incognito Jun 02 '20 at 21:53
  • You can't have one Regex expression for you block. If you were doing this in Linux I would recommend YACC (Yet Another Compiler Compiler) which uses LEX (similar to REGEX) for expressions. You expressions are 1) "int x = 5" 2) "x = 2" 3) x == 5 4) x = 5 5) If( ) A compiler you have to define your language and your expressions. You can't do both in one structure. – jdweng Jun 02 '20 at 21:56
  • #Guy Incognito : The expressions are regular. The OP is missing the language syntax. – jdweng Jun 02 '20 at 21:57
  • @jdweng "The expressions are regular."? Totally don't get your comment. The Guy's comment seem to be about "regular grammar" vs. "context free grammar" (to my understanding, like https://stackoverflow.com/a/1758162/477420.) Not really sure what your comment mean. – Alexei Levenkov Jun 02 '20 at 22:02
  • The OP want to build a COMPILER. Expressions are only a small part of a compiler. – jdweng Jun 02 '20 at 22:12
  • 2
    Well, if `{(?>[^{}]+|(?){|(?)})*}` solves the issue for your "language", then use it. If the `{` and `}` can appear in any other context than function/method body delimiters, you can't rely on the balanced construct and you will have to write a parser. – Wiktor Stribiżew Jun 02 '20 at 23:33

2 Answers2

2

Regex will not help you for this. If you are designing a scripting language, possibly to be executed, that has blocks and sub-blocks, you need context-free grammar as opposed to regular grammar which can be expressed through regular expressions.

To interpret a context-free language you need the following steps (simplified):

  1. Convert the code string to a list of tokens/symbols. This process is done by a component usually called Lexer.
  2. Convert the tokens into a structured tree (AST - Abstract Syntax Tree) based on grammar rules (things like operator precedence, nested code blocks, etc). This is done by a component usually called Parser.
  3. From here several options arise, either you translate the AST into native code, or intermediate code (like bytecode) or transpile it into another language; Or you can run it directly in memory, the most simple approach and probably what you want/need.

These should already be plenty of concepts to search for, but all of this can be achieved easily with tools like ANTLR. There might be alternatives to ANTLR obviously, I just don’t recall any just now.

rph
  • 1,339
  • 2
  • 8
  • 19
-1

I agree with those saying that regex isn't what you should use parsing code. With that said, it is possible on some reg engines to match characters and get code in a block.

This might work for you {((?>[^{}]+|(?R))*)}. If the regex engine supports recursive pattern then it is possible to do some work parsing code.

More here about it Match balanced curly braces

Per Ghosh
  • 322
  • 3
  • 7