I have C structures defined in different .h within my project and what I'm looking for is a way to check for each structure if my coding requirements are met.
For instance : I want all of my bit-field structure types to be se same base type e.g. :
typedef union {
uint8_t data;
struct {
uint8_t LSB:4;
uint8_t MSB:4;
} bit;
} MyType_t;
My original though is to extract all struct/union definition from .h header files using regular expression, and to "play" with the resulting data (i.e. matches). Then comming up with other RegEx (or any other way) to assert types are coherent and that all of my requirements are met. I'm not sure it is the best way to do so and I could do a manual check for it but the whole purpose is to have an autonomous code checker or something like that. I find it to also be a good exercice for RegEx and parsing. (I think)
To do so I just tried to create RegEx to match the following code. so I came up with the followings:
reg_A = r'((typedef )(union|struct)\n)([\t ]*\{\n)((([\t ]*(void|\w+_t) \w+[\t ]*(:\d)?;).*)\n)+([\t ]*((union|struct)\n)([\t ]*\{\n)((([\t ]*(void|\w+_t) \w+[\t ]*(:\d)?;).*)\n)+([\t ]*\} \w+;)\n)?(\} \w+;)\n'
reg_B = r'([\t ]*((((typedef )?(struct|union))|\{)|(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*)|(\} \w+ ?;))(\n+|$))+'
reg_C = r'([\t ]*typedef (struct|union))\n[\t ]*\{(([\n\t ]*(struct[\n\t ]*\{)([\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*))+[\n\t ]*\} \w+[\t ]*;)|[\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*))+[\n\t ]*\} \w+[\t ]*;'
reg_D = r'([\t ]*typedef (struct|union))\n[\t ]*\{(([\n\t ]*(struct[\n\t ]*\{)([\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*)|([\t\n ]*\/\/.*))+[\n\t ]*\} \w+[\t ]*;)|[\n\t ]*(((volatile|static|const|FAR|NEAR|INTERRUPT) )*(void|\w+_t)\*?[\t ]*\w+([\t ]*:\d+)?;.*)|([\t\n ]*\/\/.*))+[\n\t ]*\} \w+[\t ]*;'
reg_E = r'(\s*typedef (struct|union))\n\s*\{((\s*(struct\s*\{)(\s*(((volatile|static|const|FAR|NEAR|INTER{2}UPT) )*(void|\w+_t)\*?\s*\w+(\s*:\d+)?;.*)|(\s*\/\/.*))+\s*\} \w+\s*;)|\s*(((volatile|static|const|FAR|NEAR|INTER{2}UPT) )*(void|\w+_t)\*?\s*\w+(\s*:\d+)?;.*)|(\s*\/\/.*))+\s*\} \w+\s*;'
They all follow the same general Idea and may be more or less optimized for the task and/or large files.
BTW I'm using python and a function as "simple" as :
out = open('path/to/output/file.txt', 'w')
for file in self.header_files:
with open(file) as f:
whole = f.read()
print(file)
for match in re.finditer(reg_X, whole):
rslt.append(match.group())
group = match.group()
out.write(group) # all available structure definition from .h files
Here self.header_files
is a list of all files I look into. And can be easily replaced by a path to a specific file and remove the for
loop statement.
reg_X
here means that you can use any of the regex expresion defined above
Now how I constructed the RegEx (regex_D):
(
[\t ]*typedef (struct|union) <= 'OUTER' DEFINITION
)
\n[\t ]*\{ <= SPACING & BRACKETS
(
(
[\n\t ]*(struct[\n\t ]*\{) <= 'INNER' DEFINITION
(
[\n\t ]* <= SPACING
(
((volatile|static|const|FAR|NEAR|INTERRUPT) )* <= TYPE
(void|\w+_t)\*?[\t ]*\w+ <= 'FINAL' TYPE + NAME
([\t ]*:\d+)? <= BITFIELD SPECIFICATION (optional)
;.* <= EOL + whatever
)
| || OR
(
[\t\n ]*\/\/.* <= LINE STARTING WITH A COMMENT
)
)+ <= Variable definition + comment line could occurs multiple time
[\n\t ]*\} \w+[\t ]*; <= END OF 'INNER' definition1
)
| || OR
[\n\t ]* <= SPACING
(
(
(volatile|static|const|FAR|NEAR|INTERRUPT) <= TYPE
)*
(void|\w+_t) <= FINAL TYPE
\*?[\t ]*\w+ <= VAR NAME
([\t ]*:\d+)? <= Bitfield specification
;.*
)
| || OR
(
[\t\n ]*\/\/.* <= Line starting with a comment
)
)+
[\n\t ]*\} \w+[\t ]*; <= End of outer definition
A lot of the expression is doubled. I tried to have a "nicer" RegEx (regex_B)
(
[\t ]*
(
(
(
(typedef )?(struct|union)
)
|
\{
)
|
(
((volatile|static|const|FAR|NEAR|INTERRUPT) )*
(void|\w+_t)
\*?[\t ]*\w+
([\t ]*:\d+)?
;.*
)
|
(
\} \w+ ?;
)
)
(
\n+
|
$
)
)+
It contains the same 'information' bot not in the same order and with different 'requirement' such as the second one give results on any line like :
extern FAR varType_t var;
Which is just a simple variable definition.
Clarification : I went up with regex as I don't have much knowledge about parsing and practice for that matter. I'm looking for the 'best way' to complete the task. As pointed out by the answer a code parser as compiler may use is the (only) best solution.
But this question had two goals in mind. The first one has been answered.
The second objective of this post, is to know more about regex in general (& optimization). Such as having a matching group being the same as the regex_B
but avoiding any expression duplication.
As you can see the "outer" and "inner" definition are the same (down to one or two things) but can match what I want (and no more).
The last one matches the same things but as it is more "flexible" it matches where it shouldn't too.
Thanks all for your time :)
NB : If you have any resources for me to look at and gain some knowledge/experience from it, please feel free to share any things/thoughts.
[test regular expressions using : https://regex101.com/]