Writing a parser for formatted text in C++

Question

I'm trying to write a parser for a formatted ASCII file with blocks like this

.START_CMD
info1 info2 info3
* additionnal_info1...
.END

each field can be a string, an integer, a double, etc wrote in formatted text (E15.7, 64s, etc). I can also have some information that I don't want to stock.

My naive first guest is to simply do a string comparation if(!strcmp(...)) for the keywords and then string splitting by positions for information.

Do you know a more efficient way to do the same task?

Does the file has to be in ASCII or would you consider making a binary file? — Naor Hadar, Jun 24 '19 at 08:36
Check [boost spirit](https://www.boost.org/doc/libs/1_70_0/libs/spirit/doc/html/spirit/qi.html) — Victor Gubin, Jun 24 '19 at 08:39
If you want your parser to be as reliable as possible, you can use Lex to tokenize the input and YACC to generate the parser. — ForceBru, Jun 24 '19 at 08:41
@VictorGubin, does boost knows how to treat text like 1234534567 as 2 integer in 5d format? I think that the tokenizer in boost needs a separator right? — Romili Paredes, Jun 24 '19 at 09:13
@ForceBru I'll see at Lex and YACC but I read that they are quite hard to use if not for expressions :( — Romili Paredes, Jun 24 '19 at 09:14
If you only have a limited number of cases it might be worth checking them with [regular expressions](https://en.cppreference.com/w/cpp/regex). However, if you have lots of cases, this might be tedious. — jan.sende, Jun 24 '19 at 10:10
For your inspiration: A simple parser based on a Syntax Diagram [SO: How to rearrange a string equation?](https://stackoverflow.com/a/50021308/7478597). Another parser based on a simple grammar [SO: Tiny Calculator](https://stackoverflow.com/a/46965151/7478597) — Scheff's Cat, Jun 24 '19 at 10:21
@Romili Paredes spitit is an LL parser framework (like Antlr for example), tokenizer is another boost library. It knows your [EBNF grammar](https://www.boost.org/doc/libs/1_70_0/libs/spirit/doc/html/spirit/qi/reference/numeric/int.html). `template < typename T , unsigned Radix , unsigned MinDigits , int MaxDigits> struct int_parser;` — Victor Gubin, Jun 24 '19 at 10:21
When you use text, prefer `std::string` rather than character arrays. Character arrays can overflow. The `std::string` will dynamically grow as necessary. — Thomas Matthews, Jun 24 '19 at 14:50
First, you need to implement ``Monad`` and ``Applicative`` and ``Monoid`` and study 5 semesters about homotopy type theory, then you need ``Arrows`` and read a few more papers. After that it will only take you 5 years to write your own monadic parser. Oh wait - he asked about C++, not Haskell, right? Well - just write your damn code ;) — BitTickler, Jun 25 '19 at 08:47

Armin Montigny · Answer 1 · 2019-06-25T09:31:44.013

Unfortunately you did not give that much information on what you want to parse. If you give (Please edit your question), then I will write an example parser for you.

For now, I can give you only general explanations on Parsers.

It all depends on how your data is structured. You need to understand about formal languages and grammars, which are able to express your ASCII representation of information. There is also the so called Chomsky hierachy, which classifies the languages and describes also the way, how to implement a parser.

Your statement regarding

My naive first guest is to simply do a string comparation if(!strcmp(...)) for the keywords and then string splitting by positions for information.

would work, if your data is a so called Chomsky Type-3 regurlar language. You would not use strcmp() or other C-Functions, but std::regex, to match patterns in your ASCII Text and then return some results, colored with attributes.

But your example

.START_CMD info1 info2 info3 * additionnal_info1... .END

indicates, that you have some nested data with compound specifiers. That cannot be expressed by Chomsky Type-3 regular languages. Regular Expressions, usually implemented as a DFA (Deterministic Finite Automaton) cannot count. They have no memory. The know only their current state. So they cannot match the number of some "opening" statement to "closing" statements. That is not possible.

You need a grammar and best a context free grammar (CFG) to describe such a language. And a parser would be implemented using a pushdown automaton. You would use a "Parse Stack". And this stack would hold all additional information. That is the memory that regurlar expressions do not have.

And in my opinion, such an approach would be sufficient for your purposes

Now, how to implement it. There are several options:

Lex/Flex Yacc/Bison: Extremly Powerful. Difficult to understand and implement
Boost Spirit: Similar to above. Also some time needed to understand
Handcrafted Parser: Much work.

If you start with a handcrafted parser, you will learn most and understand, how parsing works. I will continue with that in my explanations.

The standard solution is a Shift/Reduce Parser.

And you will need a grammar with productions (and normally Actions)

You will need TokenTypes to find, read and consume the Lexems of the input data. This is usually implemented with regurlar expression matching.

Then you will need tokens with attributes. The Scanner/Lexer, or simply a function getToken, will read the input text and "tokenize" it. It then returns Tokens with attributes (An attribute is for example the value of an Integer) to the parser.

The parser pushes the token on the stack. Then it tries to match the top of the stack with the right side of a production. If there is a match, the stack is reduced by the number of elements in the right side of the production and replaced by the none Terminal on the Left Side of the production. And an Action is invoked.

This is repeated until all input is matched or a Syntax error is detected.

I will show you now some (NOT COMPILED. NOT TESTED) pseudo code

#include <vector>
#include <string>
#include <variant>
#include <functional>
#include <iostream>

// Here we store token types for Terminals and None-Terminals
enum class TokenType {END, OK, EXPRESSION, START1, END1, START2, END2, INTEGER, DOUBLE, STRING};

struct TokenWIthAttribute {
    TokenWIthAttribute(const TokenType &tt) : tokenType(tt) {}
    TokenWIthAttribute(const TokenWIthAttribute &twa) : tokenType(twa.tokenType) {}

    TokenType tokenType{};
    std::variant<int, double, std::string> attribute{};

    bool operator ==(const TokenWIthAttribute& twa) const { return tokenType == twa.tokenType;}
};

using NonTerminal = TokenType;
using Handle = std::vector<TokenWIthAttribute>;
using Action = std::function<TokenWIthAttribute(TokenWIthAttribute&)>;

struct Production {
    NonTerminal     nonTerminal{};  //Left side of Production
    Handle          handle{};       //Rigth side of prodcution
    Action          action;         //Action to take during reduction
};

using Grammar = std::vector<Production>;

TokenWIthAttribute actionEndOK(TokenWIthAttribute& twa) {
    // Do something with twa
    return twa;
}

Grammar grammar{
    { TokenType::OK, {TokenType::START1, TokenType::EXPRESSION, TokenType::END1, TokenType::END},actionEndOK}
    // Many lines of more productions
};

using ParseStack = std::vector<TokenWIthAttribute>;

class Parser
{
public:
    bool parse(std::istream &is);
protected:
    TokenWIthAttribute getToken(std::istream &is);
    void shift(TokenWIthAttribute& twa) { parseStack.push_back(twa); }
    bool matchAndReduce();

    ParseStack parseStack;
};


bool Parser::matchAndReduce()
{
    bool result{ false };
    // Iterate over all productions in the grammar
    for (const Production& production : grammar) {
        if (production.handle.size() <= parseStack.size()) {
            // If enough elements on the stack, match the top of the stack with a production
            if (std::equal(production.handle.begin(), production.handle.end(), parseStack.cend() - production.handle.size())) {
                // Found production: Reduce
                parseStack.resize(parseStack.size() - production.handle.size());
                // Call action. Replace right side of production with left side
                parseStack.emplace_back(production.action(*(parseStack.begin()+parseStack.size()-1)));
                result = true;
                break;
            }
        }
    }
    return result;
}
int main()
{
    std::cout << "Hello World\n";
    return 0;
}

I hope this gives you a first impression.

Writing a parser for formatted text in C++

1 Answers1