Unfortunately you did not give that much information on what you want to parse. If you give (Please edit your question), then I will write an example parser for you.
For now, I can give you only general explanations on Parsers.
It all depends on how your data is structured. You need to understand about formal languages and grammars, which are able to express your ASCII representation of information. There is also the so called Chomsky hierachy, which classifies the languages and describes also the way, how to implement a parser.
Your statement regarding
My naive first guest is to simply do a string comparation if(!strcmp(...)) for the keywords and then string splitting by positions for information.
would work, if your data is a so called Chomsky Type-3 regurlar language. You would not use strcmp() or other C-Functions, but std::regex, to match patterns in your ASCII Text and then return some results, colored with attributes.
But your example
.START_CMD
info1 info2 info3
* additionnal_info1...
.END
indicates, that you have some nested data with compound specifiers. That cannot be expressed by Chomsky Type-3 regular languages. Regular Expressions, usually implemented as a DFA (Deterministic Finite Automaton) cannot count. They have no memory. The know only their current state. So they cannot match the number of some "opening" statement to "closing" statements. That is not possible.
You need a grammar and best a context free grammar (CFG) to describe such a language. And a parser would be implemented using a pushdown automaton. You would use a "Parse Stack". And this stack would hold all additional information. That is the memory that regurlar expressions do not have.
And in my opinion, such an approach would be sufficient for your purposes
Now, how to implement it. There are several options:
- Lex/Flex Yacc/Bison: Extremly Powerful. Difficult to understand and implement
- Boost Spirit: Similar to above. Also some time needed to understand
- Handcrafted Parser: Much work.
If you start with a handcrafted parser, you will learn most and understand, how parsing works. I will continue with that in my explanations.
The standard solution is a Shift/Reduce Parser.
And you will need a grammar with productions (and normally Actions)
You will need TokenTypes to find, read and consume the Lexems of the input data. This is usually implemented with regurlar expression matching.
Then you will need tokens with attributes. The Scanner/Lexer, or simply a function getToken, will read the input text and "tokenize" it. It then returns Tokens with attributes (An attribute is for example the value of an Integer) to the parser.
The parser pushes the token on the stack. Then it tries to match the top of the stack with the right side of a production. If there is a match, the stack is reduced by the number of elements in the right side of the production and replaced by the none Terminal on the Left Side of the production. And an Action is invoked.
This is repeated until all input is matched or a Syntax error is detected.
I will show you now some (NOT COMPILED. NOT TESTED) pseudo code
#include <vector>
#include <string>
#include <variant>
#include <functional>
#include <iostream>
// Here we store token types for Terminals and None-Terminals
enum class TokenType {END, OK, EXPRESSION, START1, END1, START2, END2, INTEGER, DOUBLE, STRING};
struct TokenWIthAttribute {
TokenWIthAttribute(const TokenType &tt) : tokenType(tt) {}
TokenWIthAttribute(const TokenWIthAttribute &twa) : tokenType(twa.tokenType) {}
TokenType tokenType{};
std::variant<int, double, std::string> attribute{};
bool operator ==(const TokenWIthAttribute& twa) const { return tokenType == twa.tokenType;}
};
using NonTerminal = TokenType;
using Handle = std::vector<TokenWIthAttribute>;
using Action = std::function<TokenWIthAttribute(TokenWIthAttribute&)>;
struct Production {
NonTerminal nonTerminal{}; //Left side of Production
Handle handle{}; //Rigth side of prodcution
Action action; //Action to take during reduction
};
using Grammar = std::vector<Production>;
TokenWIthAttribute actionEndOK(TokenWIthAttribute& twa) {
// Do something with twa
return twa;
}
Grammar grammar{
{ TokenType::OK, {TokenType::START1, TokenType::EXPRESSION, TokenType::END1, TokenType::END},actionEndOK}
// Many lines of more productions
};
using ParseStack = std::vector<TokenWIthAttribute>;
class Parser
{
public:
bool parse(std::istream &is);
protected:
TokenWIthAttribute getToken(std::istream &is);
void shift(TokenWIthAttribute& twa) { parseStack.push_back(twa); }
bool matchAndReduce();
ParseStack parseStack;
};
bool Parser::matchAndReduce()
{
bool result{ false };
// Iterate over all productions in the grammar
for (const Production& production : grammar) {
if (production.handle.size() <= parseStack.size()) {
// If enough elements on the stack, match the top of the stack with a production
if (std::equal(production.handle.begin(), production.handle.end(), parseStack.cend() - production.handle.size())) {
// Found production: Reduce
parseStack.resize(parseStack.size() - production.handle.size());
// Call action. Replace right side of production with left side
parseStack.emplace_back(production.action(*(parseStack.begin()+parseStack.size()-1)));
result = true;
break;
}
}
}
return result;
}
int main()
{
std::cout << "Hello World\n";
return 0;
}
I hope this gives you a first impression.