How to parse a string into std::map and validate its format?

Question

I'd like to parse a string like "{{0, 1}, {2, 3}}" into a std::map. I can write a small function for parsing a string using <regex> library, but I have no idea how to check whether a given string is in a valid format. How can I validate the format of a string?

#include <list>
#include <map>
#include <regex>
#include <iostream>

void f(const std::string& s) {
  std::map<int, int> m;
  std::regex p {"[\\[\\{\\(](\\d+),\\s*(\\d+)[\\)\\}\\]]"};
  auto begin = std::sregex_iterator(s.begin(), s.end(), p);
  auto end = std::sregex_iterator();
  for (auto x = begin; x != end; ++x) {
    std::cout << x->str() << '\n';
    m[std::stoi(x->str(1))] = std::stoi(x->str(2));
  }
  std::cout << m.size() << '\n';
}

int main() {
  std::list<std::string> l {
    "{{0, 1},   (2,    3)}",
    "{{4,  5, {6, 7}}" // Ill-formed, so need to throw an excpetion.
  };
  for (auto x : l) {
    f(x);
  }
}

NOTE: I don't feel obliged to use regex to solve this problem. Any kind of solutions, including some ways validating and inserting at once by subtracting substrings, will be appreciated.

What should happen with the second string? You just insert in the map `{6, 7}` or you skip the whole string? — SilvanoCerza, Jun 20 '19 at 08:39
["Now they have two problems"](https://softwareengineering.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems). — n. 'pronouns' m., Jun 20 '19 at 09:56
@SilvanoCerza I don't want the second string to be inserted. Skipping the whole string or generating any kind of errors would be enough. — Han, Jun 20 '19 at 10:55
@SilvanoCerza Thank you for your code. Your code seems to do exactly what I need, but can I just wait for other answers? — Han, Jun 20 '19 at 11:22
Random tip, not quite related to your question, but you can use the raw string literals for your regex and not need to escape every other character with a `\``. I.e. `R"(\d)"` is equivalent to `"\\d"` — Charlie, Jun 21 '19 at 04:21
@LightnessRacesinOrbit I'm sorry but I don't have a clear idea what kinds of formats to validate. I thought of somethings composed of irregularly formed elements like `"{{4, 5, {6, 7}}"`, or those with unmatched braces like `"{{ {1, 2}, {3, 4} }"`. As I'd like to get the strings from users, and block anything that seems insane, those like `"{{2, 6}{}}"` in @Aleph0's code might be another example. — Han, Jun 21 '19 at 17:33
You will need to decide on that before you (or we!) can implement the functionality. — Lightness Races in Orbit, Jun 21 '19 at 17:38

score 3 · Answer 1 · answered Jun 20 '19 at 11:21

It might be a little too much, but if you have boost at your hands you can use boost-spirit to do the job for you. An advantage might be, that the solution is easily extendible to parse other kind of maps, like std::map<std::string, int> for example.

Another advantage, that shouldn't be underestimated is that boost-spirit leaves you with sane exceptions in case the string doesn't satisfy your grammar. It is quite hard to achieve this with a hand written solution.

The place where the error occurs is also given by boost-spirit, so that you might backtrack to this place.

#include <map>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/phoenix_stl.hpp>
#include <boost/fusion/adapted/std_pair.hpp>

template <typename Iterator, typename Skipper>
struct mapLiteral : boost::spirit::qi::grammar<Iterator, std::map<int,int>(), Skipper>
{
    mapLiteral() : mapLiteral::base_type(map)
    {
        namespace qi = boost::spirit::qi;
        using qi::lit;

        map = (lit("{") >> pair >> *(lit(",") >> pair) >> lit("}"))|(lit("{") >> lit("}"));
        pair = (lit("{") >> boost::spirit::int_ >> lit(",") >> boost::spirit::int_ >> lit("}"));
    }

    boost::spirit::qi::rule<Iterator, std::map<int, int>(), Skipper> map;
    boost::spirit::qi::rule<Iterator, std::pair<int, int>(), Skipper> pair;
};

std::map<int,int> parse(const std::string& expression, bool& ok)
{
    std::map<int, int>  result;
    try {
        std::string formula = expression;
        boost::spirit::qi::space_type space;
        mapLiteral<std::string::const_iterator, decltype(space)> parser;
        auto b = formula.begin();
        auto e = formula.end();
        ok = boost::spirit::qi::phrase_parse(b, e, parser, space, result);
        if (b != e) {
            ok = false;
            return std::map<int, int>();
        }
        return result;
    }
    catch (const boost::spirit::qi::expectation_failure<std::string::iterator>&) {
        ok = false;
        return result;
    }
}


int main(int argc, char** args)
{
    std::vector<std::pair<std::map<int, int>,std::string>> tests = {
        {{ },"{  \t\n}"},
        {{{5,2},{2,1}},"{ {5,2},{2,1} }"},
        {{},"{{2, 6}{}}"} // Bad food
    };
    for (auto iter :tests)
    {
        bool ok;
        auto result = parse(iter.second, ok);
        if (result == iter.first)
        {
            std::cout << "Equal:" << std::endl;
        }
    }
}

Armin Montigny · Answer 2 · 2019-06-21T18:27:57.870

Since Han mentioned in his comments that he would like to wait for further ideas, I will show an additional solution.

And as everybody before, I think it is the most appropriate solution :-)

Additionally, I will unpack the "big hammer", and talk about "languages" and "grammars" and, uh oh, Chomsky Hierachy.

First a very simple answer: Pure Regular Expressions cannot count. So, they cannot check matching braces, like 3 open braces and 3 closed Braces.

They are mostly implemented as DFA (Deterministic Finite Automaton), also known as FSA (Finite State Automaton). One of the relevant properties here is that they do know only about their current state. They cannot "remember" previous states. They have no memory.

The languages that they can produce are so-called "regular languages". In the Chomsky hierarchy, the grammar to produce such a regular language is of Type-3. And “regular expressions” can be used to produce such languages.

However, there are extensions to regular expressions that can also be used to match balanced braces. See here: Regular expression to match balanced parentheses

But these are not regular expression as per the original definition.

What we really need, is a Chomsky-Type-2 grammar. A so-called context-free-grammar. And this will usually be implemented with a pushdown-automaton. A stack is used to store additional state. This is the “memory” that regular expressions do not have.

So, if we want to check the syntax of a given expression, as in your case the input for a std::map, we can define an ultra-simple Grammar and parse the input string using the standard classical approach: A Shift/Reduce Parser.

There are several steps necessary: First the input stream will be split into Lexems od Tokens. This is usually done by a so called Lexer or Scanner. You will always find a function like getNextToken or similar. Then the Tokens will be shifted on the stack. The Stack Top will be matched against productions in the grammar. If there is a match with the right side of the production, the elements in the stack will be replaced by the none-terminal on the left side of the productions. This procedure will be repeated until the start symbol of the grammar will be hit (meaning everything was OK) or a syntax error will be found.

Regarding your question:

How to parse a string into std::map and validate its format?

I would split it in to 2 tasks.

Parse the string to validate the format
If the string is valid, put the data into a map

Task 2 is simple and typically a one-liner using a std::istream_iterator.

Task 1 unfortunately needs a shift-reduce-parser. This is a little bit complex.

In the attached code below, I show one possible solution. Please note: This can of cause be optimized by using Token with attributes. The attributes would be an integer number and the type of the brace. The Token with attributes would be stored on the parse stack. With that we could eliminate the need to have productions for all kind of braces and we could fill the map in the parser (in the reduction operation of one of “{Token::Pair, { Token::B1open, Token::Integer, Token::Comma, Token::Integer, Token::B1close} }”

Please see the code below:

#include <iostream>
#include <iterator>
#include <sstream>
#include <map>
#include <vector>
#include <algorithm>

// Tokens:  Terminals and None-Terminals
enum class Token { Pair, PairList, End, OK, Integer, Comma, B1open, B1close, B2open, B2close, B3open, B3close };

// Production type for Grammar
struct Production { Token nonTerminal; std::vector<Token> rightSide; };

// The Context Free Grammar CFG
std::vector<Production>    grammar
{
       {Token::OK, { Token::B1open, Token::PairList, Token::B1close } },
       {Token::OK, { Token::B2open, Token::PairList, Token::B2close } },
       {Token::OK, { Token::B3open, Token::PairList, Token::B3close } },
       {Token::PairList, { Token::PairList, Token::Comma, Token::Pair}    },
       {Token::PairList, { Token::Pair } },
       {Token::Pair, { Token::B1open, Token::Integer, Token::Comma, Token::Integer, Token::B1close} },
       {Token::Pair, { Token::B2open, Token::Integer, Token::Comma, Token::Integer, Token::B2close} },
       {Token::Pair, { Token::B3open, Token::Integer, Token::Comma, Token::Integer, Token::B3close} }
};
// Helper for translating brace characters to Tokens
std::map<const char, Token> braceToToken{
 {'(',Token::B1open},{'[',Token::B2open},{'{',Token::B3open},{')',Token::B1close},{']',Token::B2close},{'}',Token::B3close},
};

// A classical    SHIFT - REDUCE  Parser
class Parser
{
public:
    Parser() : parseString(), parseStringPos(parseString.begin()) {}
    bool parse(const std::string& inputString);
protected:
    // String to be parsed
    std::string parseString{}; std::string::iterator parseStringPos{}; // Iterator for input string

    // The parse stack for the Shift Reduce Parser
    std::vector<Token> parseStack{};

    // Parser Step 1:   LEXER    (lexical analysis / scanner)
    Token getNextToken();
    // Parser Step 2:   SHIFT
    void shift(Token token) { parseStack.push_back(token); }
    // Parser Step 3:   MATCH / REDUCE
    bool matchAndReduce();
};

bool Parser::parse(const std::string& inputString)
{
    parseString = inputString; parseStringPos = parseString.begin(); parseStack.clear();
    Token token{ Token::End };
    do   // Read tokens untils end of string
    {
        token = getNextToken();     // Parser Step 1:   LEXER    (lexical analysis / scanner)                    
        shift(token);               // Parser Step 2:   SHIFT
        while (matchAndReduce())    // Parser Step 3:   MATCH / REDUCE
            ; // Empty body
    } while (token != Token::End);  // Do until end of string reached
    return (!parseStack.empty() && parseStack[0] == Token::OK);
}

Token Parser::getNextToken()
{
    Token token{ Token::End };
    // Eat all white spaces
    while ((parseStringPos != parseString.end()) && std::isspace(static_cast<int>(*parseStringPos))) {
        ++parseStringPos;
    }
    // Check for end of string
    if (parseStringPos == parseString.end()) {
        token = Token::End;
    }
    // Handle digits
    else if (std::isdigit(static_cast<int>(*parseStringPos))) {
        while ((((parseStringPos + 1) != parseString.end()) && std::isdigit(static_cast<int>(*(parseStringPos + 1)))))        ++parseStringPos;
        token = Token::Integer;
    }
    // Detect a comma
    else if (*parseStringPos == ',') {
        token = Token::Comma;
        // Else search for all kind of braces
    }
    else {
        std::map<const char, Token>::iterator foundBrace = braceToToken.find(*parseStringPos);
        if (foundBrace != braceToToken.end()) token = foundBrace->second;
    }
    // In next function invocation the next string element will be checked
    if (parseStringPos != parseString.end())
        ++parseStringPos;

    return token;
}


bool Parser::matchAndReduce()
{
    bool result{ false };
    // Iterate over all productions in the grammar
    for (const Production& production : grammar) {
        if (production.rightSide.size() <= parseStack.size()) {
            // If enough elements on the stack, match the top of the stack with a production
            if (std::equal(production.rightSide.begin(), production.rightSide.end(), parseStack.end() - production.rightSide.size())) {
                // Found production: Reduce
                parseStack.resize(parseStack.size() - production.rightSide.size());
                // Replace right side of production with left side
                parseStack.push_back(production.nonTerminal);
                result = true;
                break;
            }
        }
    }
    return result;
}

using IntMap = std::map<int, int>;
using IntPair = std::pair<int, int>;

namespace std {
    istream& operator >> (istream& is, IntPair& intPair)    {
        return is >> intPair.first >> intPair.second;
    }
    ostream& operator << (ostream& os, const pair<const int, int>& intPair) {
        return os << intPair.first << " --> " << intPair.second;
    }
}

int main()
{   // Test Data. Test Vector with different strings to test
    std::vector <std::string> testVector{
        "({10, 1 1},   (2,  3) , [5 ,6])",
        "({10, 1},   (2,  3) , [5 ,6])",
        "({10, 1})",
        "{10,1}"
    };
    // Define the Parser
    Parser parser{};
    for (std::string& test : testVector)
    {   // Give some nice info to the user
        std::cout << "\nChecking '" << test << "'\n";
        // Parse the test string and test, if it is valid
        bool inputStringIsValid = parser.parse(test);
        if (inputStringIsValid) {               // String is valid. Delete everything but digits
            std::replace_if(test.begin(), test.end(), [](const char c) {return !std::isdigit(static_cast<int>(c)); }, ' ');
            std::istringstream iss(test);       // Copy string with digits int a istringstream, so that we can read with istream_iterator
            IntMap intMap{ std::istream_iterator<IntPair>(iss),std::istream_iterator<IntPair>() };
            // Present the resulting data in the map to the user
            std::copy(intMap.begin(), intMap.end(), std::ostream_iterator<IntPair>(std::cout, "\n"));
        } else {
            std::cerr << "***** Invalid input data\n";
        }
    }
    return 0;
}

I hope this is not too complex. But it is the "mathematical" correct solution. Have fun . . .

Thank you for your comprehensive explanation of formal languages. Your code helped me a lot to understand. — Han, Jun 23 '19 at 16:39

score 3 · Accepted Answer · answered Jun 21 '19 at 13:17

In my opinion, Spirit-based parser is always much more robust and readable. It is also much more fun to parse with Spirit :-). So, in addition to @Aleph0 's answer, I'd like to provide a compact solution based on Spirit-X3:

#include <string>
#include <map>
#include <iostream>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/home/x3.hpp>

int main() {
    std::string input ="{{0, 1},  {2, 3}}";
    using namespace boost::spirit::x3;
    const auto pair = '{' > int_ > ',' > int_ > '}';
    const auto pairs = '{' > (pair % ',')  > '}';
    std::map<int, int> output;
    // ignore spaces, tabs, newlines
    phrase_parse(input.begin(), input.end(), pairs, space, output);

    for (const auto [key, value] : output) {
        std::cout << key << ":" << value << std::endl;
    }
}

Note that I used operator >, which means "expect". So, if the input does not match the expectation, Spirit throws an exception. If you prefer a silent failure, use operator >> instead.

SilvanoCerza · Answer 4 · 2019-06-20T11:01:18.160

You can validate your strings by checking just the parentheses like so, this is not extremely efficient since it always iterates each string but it can be optimized.

#include <list>
#include <iostream>
#include <string>

bool validate(std::string s)
{
    std::list<char> parens;
    for (auto c : s) {
        if (c == '(' || c == '[' || c == '{') {
            parens.push_back(c);
        }

        if (c == ')' && parens.back() == '(') {
            parens.pop_back();
        } else if (c == ']' && parens.back() == '[') {
            parens.pop_back();
        } else if (c == '}' && parens.back() == '{') {
            parens.pop_back();
        }
    }
    return parens.size() == 0;
}


int main()
{
  std::list<std::string> l {
    "{{0, 1},   (2,    3)}",
    "{{4,  5, {6, 7}}" // Ill-formed, so need to throw an excpetion.
  };

  for (auto s : l) {
      std::cout << "'" << s << "' is " << (validate(s) ? "" : "not ") << "valid" << std::endl;
  }

  return 0;
}

The output of the above code is this:

'{{0, 1},   (2,    3)}' is valid
'{{4,  5, {6, 7}}' is notvalid

EDIT:

This version should be more efficient since it returns right after it notices a string is not valid.

bool validate(std::string s)
{
    std::list<char> parens;
    for (auto c : s) {
        if (c == '(' || c == '[' || c == '{') {
            parens.push_back(c);
        }

        if (c == ')') {
            if (parens.back() != '(') {
                return false;
            }
            parens.pop_back();
        } else if (c == ']') {
            if (parens.back() != '[') {
                return false;
            }
            parens.pop_back();
        } else if (c == '}') {
            if (parens.back() != '{') {
                return false;
            }
            parens.pop_back();
        }
    }
    return parens.size() == 0;
}

I'd probably use a stack rather than a list but yeah – Lightness Races in Orbit Jun 21 '19 at 14:13 — Lightness Races in Orbit, Jun 21 '19 at 14:13

gimme_danger · Answer 5 · 2019-06-20T09:32:43.207

Your regex parses single map element perfectly. I suggest you to validate string before creating map and filling it with parsed elements.

Let's use slightly improved version of you regex:

[\[\{\(](([\[\{\(](\d+),(\s*)(\d+)[\)\}\]])(,?)(\s*))*[\)\}\]]

It matches the whole string if it is valid: it begins with [\[\{\(], ends with [\)\}\]], contains several (or zero) pattern of map element inside followed by , and multiple (or zero) spaces.

Here is the code:

#include <list>
#include <map>
#include <regex>
#include <sstream>
#include <iostream>

void f(const std::string& s) {
  // part 1: validate string
  std::regex valid_pattern {"[\\[\\{\\(](([\\[\\{\\(](\\d+),(\\s*)(\\d+)[\\)\\}\\]])(,?)(\\s*))*[\\)\\}\\]]"};
  auto valid_begin = std::sregex_iterator(s.begin(), s.end(), valid_pattern);
  auto valid_end = std::sregex_iterator();
  if (valid_begin == valid_end || valid_begin->str().size() != s.size ()) {
    std::stringstream res;
    res << "String \"" << s << "\" doesn't satisfy pattern!";
    throw std::invalid_argument (res.str ());
  } else {
    std::cout << "String \"" << s << "\" satisfies pattern!" << std::endl;
  }

  // part 2: parse map elements
  std::map<int, int> m;
  std::regex pattern {"[\\[\\{\\(](\\d+),\\s*(\\d+)[\\)\\}\\]]"};
  auto parsed_begin = std::sregex_iterator(s.begin(), s.end(), pattern);
  auto parsed_end = std::sregex_iterator();
  for (auto x = parsed_begin; x != parsed_end; ++x) {
    m[std::stoi(x->str(1))] = std::stoi(x->str(2));
  }

  std::cout << "Number of parsed elements: " << m.size() << '\n';
}

int main() {
  std::list<std::string> l {
      "{}",
      "[]",
      "{{0, 153}, (2, 3)}",
      "{{0,      153},   (2,    3)}",
      "{[0, 153],           (2, 3), [154, 33]   }",
      "{[0, 153],           (2, 3), [154, 33]   ", // Ill-formed, so need to throw an exception.
      "{{4,  5, {6, 7}}", // Ill-formed, so need to throw an exception.
      "{{4,  5, {x, 7}}" // Ill-formed, so need to throw an exception.
  };
  for (const auto &x : l) {
    try {
      f(x);
    }
    catch (std::invalid_argument &ex) {
      std::cout << ex.what () << std::endl;
    }
    std::cout << std::endl;
  }
}

Here is the output:

String "{}" satisfies pattern!
Number of parsed elements: 0

String "[]" satisfies pattern!
Number of parsed elements: 0

String "{{0, 153}, (2, 3)}" satisfies pattern!
Number of parsed elements: 2

String "{{0,      153},   (2,    3)}" satisfies pattern!
Number of parsed elements: 2

String "{[0, 153],           (2, 3), [154, 33]   }" satisfies pattern!
Number of parsed elements: 3

String "{[0, 153],           (2, 3), [154, 33]   " doesn't satisfy pattern!

String "{{4,  5, {6, 7}}" doesn't satisfy pattern!

String "{{4,  5, {x, 7}}" doesn't satisfy pattern!

PS It has only one defect. It doesn't check that corresponding closing bracket is equal to the opening one. So it matches this: {], {(1,2]) etc. If It is not okay for you, the easiest way to fix it is to add some extra validation code before putting parsed pair in map.

PPS If you are able to avoid regex's, your problem could be solved much more efficient with a single string scan for each string. @SilvanoCerza proposed an implementation for this case.

How to parse a string into std::map and validate its format?

5 Answers5

Linked