3

Context

I am using Boost Spirit X3 to generate HTML from Markdown files like this:

// simplified code structure
auto str = x3::lexeme[+(x3::char_ - x3::eol)];

auto h1_action = [&](auto& ctx) { /* generate output */ };
auto h1 = ("# " > str)[h1_action];

auto p_action = [&](auto& ctx) { /* generate output */ };
auto p = (+(str > x3::eol))[p_action];

auto markdown = (h1 | p) % *x3::eol;

x3::phrase_parse(begin, end, markdown, x3::blank);

Problem 1: Symbol Tables

I use symbol tables to recognize escaped characters:

x3::symbols<char> esc;
esc.add
    ("\\(", '(')
    ("\\)", ')')
    /* ... */;

auto dummy_action = [&](auto& ctx) {
    auto value = x3::attr(ctx);
};

auto esc_str = (x3::lexeme[+(esc | (x3::char_ - x3::eol))])[dummy_action];

auto value in the above lambda dummy_action is of type std::vector<boost::variant<char, char>>, when according to the Compound Attribute Rules it should have been a std::vector<char> and thus a mere string.

a: A, b: B --> (a | b): variant<A, B>
a: A, b: A --> (a | b): A

These rules state, if both arguments of an alternative parser (in the above parser esc_str that would be esc and (x3::char_ - x3::eol)) are of the same type (char in this case), the result is not a variant. But it is and I would like to understand why.

Problem 2: Character Parser

I want to parse a string in parentheses followed by an x3::eol, where the string itself can also contain arbitrary parentheses, e.g. (path/to/img(2).png)\n.

auto paran = x3::char_(')') >> !x3::eol;
auto ch = x3::char_ - (x3::eol | ')');
auto src = (x3::lexeme[x3::lit('(') >> +(paran | ch) >> ')'])[dummy_action];

Same problem: I expect the attribute of src to be of type string, since the two arguments paran and ch of the alternative parser (paran | ch) have the same attribute, but the attribute of src is std::vector<boost::variant<char, char>>.

Conclusion

Obviously I can simply transform the result into a string, but does anyone have an alternative on how to parse the above examples and directly receive a string as a result or an explanation as to why the result is a vector of variant<char, char>. (Why would you want a variant that contains the exact same type more than once?)

Thanks in advance for any help!

Edit

Markdown Reference: I am using this markdown reference, which does not specify an underlying grammar, so additionally, I use the built-in markdown preview feature of Visual Studio Code to analyse edge cases. I am aware all parsers in my post are not correct or at least incomplete (e.g. empty headers are not recognized).

Semantic Actions: I am aware that separation of concerncs would be a better approach, i.e. generating an AST/DOM first and then generate the output from that. I am just unable to do so and since I only want to parse a very limited subset of markdown, I settled for the Semantic Action approach.

melina
  • 68
  • 5

1 Answers1

3

I agree that the attribute synthesis can be ... surprising.

The root cause seems to be an insistence on using semantic actions on raw parser expressions. The usual way to appraoch this would be like

auto str
    = x3::rule<struct str_, std::string>{"str"}
    = x3::lexeme[+(esc | (x3::char_ - x3::eol))];

and then, if you must, attach a SA to str in the higher-level rule (which you basically have, in p_action).

Second Problem

Seems to be very much the same as the first.

Same problem: I expect the attribute of src to be of type string, since the two arguments paran and ch of the alternative parser (paran | ch)

If you expect the attribute of src to be a specific type, you should probably declare it as such:

auto eol = x3::eol | x3::eoi;
auto paran
    //= x3::rule<struct last_paren_, std::string> {"paran"} // aids in debug output
    = char_(')') >> !eol;
auto src
    = x3::rule<struct src_, std::string>{"src"}
    = lexeme['(' >> *(~char_("\r\n \t\b\f)") | paran) >> ')'];

Observations

  1. I question that the grammar is correct. In particular, the way you currently specify how parentheses can be embedded inside hyperlink source specs doesn't match markdown engines I'm aware of

    Including StackOverflow's, as you can see. There was no need for the last parenthesis to be at EOL.

    If you have a reference the particular Markdown specification you're trying to implement, I'd be happy to evaluate more.

  2. Also, I'm not convinced that the approach of doing all the heavy lifting in semantic actions (see Boost Spirit: "Semantic actions are evil"?) is a helpful one.

    In general I'd advise to separate concerns of parsing and output generation. This will make it much easier to

    • achieve/test correctness
    • maintain the parser
    • change the way output is generated from a parsed document

Full Demo

Here's a demo tying things together, while still keeping with your current approach, emphasizing semantic actions.

Hopefully the improvements and ideas shown help. Especially the conditionally enabled rule debugging could be a big productivity accelerator while you're learning or maintaining your grammar.

Live On Compiler Explorer

#define BOOST_SPIRIT_X3_DEBUG
#include <boost/spirit/home/x3.hpp>
#include <fstream>
#include <iostream>
#include <boost/core/demangle.hpp>

namespace x3 = boost::spirit::x3;
namespace Parser {
    using x3::char_;
    using x3::lit;
    using x3::lexeme;
    // simplified code structure
#if 0
    auto str = lexeme[+(char_ - x3::eol)];
#else
    auto esc = [] {
        x3::symbols<char> esc;
        esc.add
            ("\\(", '(')("\\)", ')')
            ("\\[", '[')("\\]", ']')
        /* ... */;
        return esc;
    }();

    auto eol = x3::eol | x3::eoi;
    auto paran
        //= x3::rule<struct last_paren_, std::string> {"paran"} // aids in debug output
        = char_(')') >> !eol;
    auto src
        = x3::rule<struct src_, std::string>{"src"}
        = lexeme['(' >> *(~char_("\r\n \t\b\f)") | paran) >> ')'];

    auto hyperlink 
        = x3::rule<struct hyperlink_, std::string>{"hyperlink"}
        = '[' >> *(esc | ~char_("\r\n]")) >> ']' >> src;

    auto str
        = x3::rule<struct str_, std::string>{"str"}
        = lexeme[
            +( esc
            | &lit('[') >> hyperlink  // the &lit supresses verbose debug
            | (char_ - x3::eol)
            )];
#endif

    auto h1_action = [](auto &) { /* generate output */ };
    auto h1
        = x3::rule<struct h1_, std::string> {"h1"}
        = ("# " > str)[h1_action]
        ;

    auto p_action = [](auto &) { /* generate output */ };
    auto p
        = x3::rule<struct p_, std::string> {"p"}
        = (+(str > eol))[p_action];

    auto content
        = x3::rule<struct lines_, std::string> {"content"}
        = (h1 | p) % +x3::eol;

    auto markdown = x3::skip(x3::blank)[*x3::eol >> content];
} // namespace Parser

int main() {
#if 0
    std::ifstream ifs("input.txt");
    std::string const s(std::istreambuf_iterator<char>(ifs), {});
#else
    std::string const s = R"(
# Frist

This [introduction](https://en.wikipedia.org/wiki/Wikipedia:Introduction_(historical))
serves no purpose. Other than to show some [[hyper\]links](path/to/img(2).png)
)";
#endif

    parse(begin(s), end(s), Parser::markdown);
}

Prints debug output:

<content>
  <try># Frist\n\n    This [i</try>
  <h1>
    <try># Frist\n\n    This [i</try>
    <str>
      <try>Frist\n\n    This [int</try>
      <success>\n\n    This [introduc</success>
      <attributes>[F, r, i, s, t]</attributes>
    </str>
    <success>\n\n    This [introduc</success>
  </h1>
  <h1>
    <try>This [introduction](</try>
    <fail/>
  </h1>
  <p>
    <try>This [introduction](</try>
    <str>
      <try>This [introduction](</try>
      <hyperlink>
        <try>[introduction](https</try>
        <src>
          <try>(https://en.wikipedi</try>
          <success>\n    serves no purpo</success>
          <attributes>[h, t, t, p, s, :, /, /, e, n, ., w, i, k, i, p, e, d, i, a, ., o, r, g, /, w, i, k, i, /, W, i, k, i, p, e, d, i, a, :, I, n, t, r, o, d, u, c, t, i, o, n, _, (, h, i, s, t, o, r, i, c, a, l, )]</attributes>
        </src>
        <success>\n    serves no purpo</success>
        <attributes>[i, n, t, r, o, d, u, c, t, i, o, n, h, t, t, p, s, :, /, /, e, n, ., w, i, k, i, p, e, d, i, a, ., o, r, g, /, w, i, k, i, /, W, i, k, i, p, e, d, i, a, :, I, n, t, r, o, d, u, c, t, i, o, n, _, (, h, i, s, t, o, r, i, c, a, l, )]</attributes>
      </hyperlink>
      <success>\n    serves no purpo</success>
      <attributes>[T, h, i, s,  , i, n, t, r, o, d, u, c, t, i, o, n, h, t, t, p, s, :, /, /, e, n, ., w, i, k, i, p, e, d, i, a, ., o, r, g, /, w, i, k, i, /, W, i, k, i, p, e, d, i, a, :, I, n, t, r, o, d, u, c, t, i, o, n, _, (, h, i, s, t, o, r, i, c, a, l, )]</attributes>
    </str>
    <str>
      <try>    serves no purpos</try>
      <hyperlink>
        <try>[[hyper\]links](path</try>
        <src>
          <try>(path/to/img(2).png)</try>
          <success>\n    </success>
          <attributes>[p, a, t, h, /, t, o, /, i, m, g, (, 2, ), ., p, n, g]</attributes>
        </src>
        <success>\n    </success>
        <attributes>[[, h, y, p, e, r, ], l, i, n, k, s, p, a, t, h, /, t, o, /, i, m, g, (, 2, ), ., p, n, g]</attributes>
      </hyperlink>
      <success>\n    </success>
      <attributes>[s, e, r, v, e, s,  , n, o,  , p, u, r, p, o, s, e, .,  , O, t, h, e, r,  , t, h, a, n,  , t, o,  , s, h, o, w,  , s, o, m, e,  , [, h, y, p, e, r, ], l, i, n, k, s, p, a, t, h, /, t, o, /, i, m, g, (, 2, ), ., p, n, g]</attributes>
    </str>
    <str>
      <try>    </try>
      <fail/>
    </str>
    <success>    </success>
  </p>
  <success>    </success>
</content>
sehe
  • 328,274
  • 43
  • 416
  • 565
  • Thanks for taking the time to write up such an in-depth answer. It improved my understanding of Spirit X3 and solved my problem. From [the documentation](https://ciere.com/cppnow15/x3_docs/) I did not quite understand the purpose of rules. One question remains: In the `str` parser, is it really necessary to have `&x3::lit('[') >> hyperlink`. If you omit the look-ahead (change it to just `hyperlink`), won't the parser be the same, since it tries to match the hyperlink first, but immediately fails when the first char is not a `[` and then continues to the next alternative? – melina Mar 08 '21 at 22:19
  • The comment (_`// the &lit supresses verbose debug`_) was added to indicate why I added the redundant `&x3::lit('[')` lookahead, yes. It's just to make debug output better. – sehe Mar 08 '21 at 22:36