Extracting structure definition from .h files using RegEx to check for code consistency according to [XXX] standart

Question

I have C structures defined in different .h within my project and what I'm looking for is a way to check for each structure if my coding requirements are met.

For instance : I want all of my bit-field structure types to be se same base type e.g. :

typedef union {
    uint8_t data;
    struct {
        uint8_t LSB:4;
        uint8_t MSB:4;
    } bit;
} MyType_t;

My original though is to extract all struct/union definition from .h header files using regular expression, and to "play" with the resulting data (i.e. matches). Then comming up with other RegEx (or any other way) to assert types are coherent and that all of my requirements are met. I'm not sure it is the best way to do so and I could do a manual check for it but the whole purpose is to have an autonomous code checker or something like that. I find it to also be a good exercice for RegEx and parsing. (I think)

To do so I just tried to create RegEx to match the following code. so I came up with the followings:

reg_A = r'((typedef )(union|struct)\n)([\t ]*\{\n)((([\t ]*(void|\w+_t) \w+[\t ]*(:\d)?;).*)\n)+([\t ]*((union|struct)\n)([\t ]*\{\n)((([\t ]*(void|\w+_t) \w+[\t ]*(:\d)?;).*)\n)+([\t ]*\} \w+;)\n)?(\} \w+;)\n'
reg_B = r'([\t ]*((((typedef )?(struct|union))|\{)|(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*)|(\} \w+ ?;))(\n+|$))+'
reg_C = r'([\t ]*typedef (struct|union))\n[\t ]*\{(([\n\t ]*(struct[\n\t ]*\{)([\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*))+[\n\t ]*\} \w+[\t ]*;)|[\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*))+[\n\t ]*\} \w+[\t ]*;'
reg_D = r'([\t ]*typedef (struct|union))\n[\t ]*\{(([\n\t ]*(struct[\n\t ]*\{)([\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*)|([\t\n ]*\/\/.*))+[\n\t ]*\} \w+[\t ]*;)|[\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*)|([\t\n ]*\/\/.*))+[\n\t ]*\} \w+[\t ]*;'
reg_E = r'(\s*typedef (struct|union))\n\s*\{((\s*(struct\s*\{)(\s*(((volatile|static|const|FAR|NEAR|INTER{2}UPT) )*(void|\w+_t)\*?\s*\w+(\s*:\d+)?;.*)|(\s*\/\/.*))+\s*\} \w+\s*;)|\s*(((volatile|static|const|FAR|NEAR|INTER{2}UPT) )*(void|\w+_t)\*?\s*\w+(\s*:\d+)?;.*)|(\s*\/\/.*))+\s*\} \w+\s*;'

They all follow the same general Idea and may be more or less optimized for the task and/or large files.

BTW I'm using python and a function as "simple" as :

out = open('path/to/output/file.txt', 'w')
for file in self.header_files:
    with open(file) as f:
        whole = f.read()
        print(file)
        for match in re.finditer(reg_X, whole):
            rslt.append(match.group())
            group = match.group()
            out.write(group) # all available structure definition from .h files

Here self.header_files is a list of all files I look into. And can be easily replaced by a path to a specific file and remove the for loop statement.

reg_X here means that you can use any of the regex expresion defined above

Now how I constructed the RegEx (regex_D):

(
    [\t ]*typedef (struct|union)                                <= 'OUTER' DEFINITION
)
\n[\t ]*\{                                                      <= SPACING & BRACKETS
(
    (
        [\n\t ]*(struct[\n\t ]*\{)                              <= 'INNER' DEFINITION
        (
            [\n\t ]*                                            <= SPACING
            (
                ((volatile|static|const|FAR|NEAR|INTERRUPT) )*  <= TYPE
                (void|\w+_t)\*?[\t ]*\w+                        <= 'FINAL' TYPE + NAME
                ([\t ]*:\d+)?                                   <= BITFIELD SPECIFICATION (optional)
                ;.*                                             <= EOL + whatever
            )
            |                                                   || OR
            (
                [\t\n ]*\/\/.*                                  <= LINE STARTING WITH A COMMENT
            )
        )+                                                      <= Variable definition + comment line could occurs multiple time
        [\n\t ]*\} \w+[\t ]*;                                   <= END OF 'INNER' definition1
    )
    |                                                           || OR
    [\n\t ]*                                                    <= SPACING
    (
        (
            (volatile|static|const|FAR|NEAR|INTERRUPT)          <= TYPE
        )*
        (void|\w+_t)                                            <= FINAL TYPE
        \*?[\t ]*\w+                                            <= VAR NAME
        ([\t ]*:\d+)?                                           <= Bitfield specification
        ;.*
    )
    |                                                           || OR
    (
        [\t\n ]*\/\/.*                                          <= Line starting with a comment
    )
)+
[\n\t ]*\} \w+[\t ]*;                                           <= End of outer definition

A lot of the expression is doubled. I tried to have a "nicer" RegEx (regex_B)

(
    [\t ]*
    (
        (
            (
                (typedef )?(struct|union)
            )
            |
            \{
        )
        |
        (
            ((volatile|static|const|FAR|NEAR|INTERRUPT) )*
            (void|\w+_t)
            \*?[\t ]*\w+
            ([\t ]*:\d+)?
            ;.*
        )
        |
        (
            \} \w+ ?;
        )
    )
    (
        \n+
        |
        $
    )
)+

It contains the same 'information' bot not in the same order and with different 'requirement' such as the second one give results on any line like : extern FAR varType_t var; Which is just a simple variable definition.

Clarification : I went up with regex as I don't have much knowledge about parsing and practice for that matter. I'm looking for the 'best way' to complete the task. As pointed out by the answer a code parser as compiler may use is the (only) best solution.

But this question had two goals in mind. The first one has been answered.

The second objective of this post, is to know more about regex in general (& optimization). Such as having a matching group being the same as the regex_B but avoiding any expression duplication. As you can see the "outer" and "inner" definition are the same (down to one or two things) but can match what I want (and no more). The last one matches the same things but as it is more "flexible" it matches where it shouldn't too.

Thanks all for your time :)

NB : If you have any resources for me to look at and gain some knowledge/experience from it, please feel free to share any things/thoughts.

[test regular expressions using : https://regex101.com/]

What is your question? How to optimize your code? How to fix your code? What's wrong with your code? Analyzing C code may require considerable effort. Maybe you can use existing tools for static code analysis. There are (more or less expensive) commercial tools available that may allow creatiung custom rules. I don't know if open-source tools can do what you need. See e.g. https://stackoverflow.com/a/30955450/10622916 — Bodo, Apr 13 '21 at 14:38
My question is not how to fix the code, the code works as intended, the regex does what it should. The question was more about given a task and what I found/constructed, is the way I handled the task a good way or is it a waste of time. As pointed by the answer I should implement/use a code parser a compilators do. I do not want existing solution as in 'commercial tools'. The other part of this question was about regex optimization. From the two I 'exploded' at the end of the post I would love to know if there is a way to have the same matching as the first one using a one like the 2nd — Nicoas Wadel, Apr 13 '21 at 14:45
You should add all clarification or requested information to the question instead of using comments for this. (Writing a note that you updated the question is useful.) — Bodo, Apr 13 '21 at 14:49

Mike Robinson · Accepted Answer · 2021-04-13T14:54:02.847

The only correct approach is to instead use a parser, such as this one for Python. Which has many /examples ...

Offload the entire problem of "correctly handing the intricacies of the language itself" to this already existing and tested parser, instead of trying to use regular expressions to "create a sort-of parser, yourself." Which is a fool's errand.

Parsers work in various ways, but in essence they work by scanning the source-code to internally create an Abstract-Syntax Tree, or AST. Then, they allow you to explore that tree or to iterate over it through the use of so-called "visitors."

The tree itself is not "abstract." The term comes from the fact that its content is an abstraction of the source-code syntax, representing what the source-code says, not the actual text of it.

You can then create a "visitor" which will be invoked whenever the AST-node of interest is encountered during the "walk." This will be "the right place and time" to gather whatever information you need – but you don't have to be concerned about how you got there. That's the parser's job.

For example, you "visit" some node that corresponds to a typedef struct declaration. Child-nodes of this node will represent the abstracct essence of the structure that you are now looking at, no matter how it was written in the source-code text, which is long gone and by now completely irrelevant. The organization and content of those child nodes is known and can be relied upon. Parents of this node represent whatever "contains" this typedef. All of this heavy-lifting is done for you automagically by the parser.

Footnote: "This, of course, is how every language interpreter or compiler itself actually works on the front end." No doubt, this package was built using the official so-called "grammar" for the C99 language, which would also be used by every compiler for that language. The technology – including the method for constructing such parsers – is proven, fast, and efficient.

Extracting structure definition from .h files using RegEx to check for code consistency according to [XXX] standart

1 Answers1