3

I'm writing a parser for the first time. I'm following this tutorial on Pratt parers. I've got it to work, but I've come up with sort of a problem.

The original tutorial is written in Java. I prefer C++, so that's what I wrote mine with. I was able to basically port most of the code to C++ (although, I did make it "mine" in the sense that there are some non-language related differences). The only real problem I have is with this line of code:

public Expression parse(Parser parser, Token token) {
   Expression operand = parser.parseExpression();
?    return new PrefixExpression(token.getType(), operand);

This works fine in Java (I'm assuming. I've never really worked with Java before, but I assume the guy knows what he's doing), but in C++ not so much. I was able to accomplish the same thing by using pointers like so:

Expression* parse(Parser& parser, Token token) {
    Expression* operand = parser.parseExpression();
    return new PrefixExpression(token.getType(), operand);

Which (although I am unfamiliar with the semantics of Java) seems to do the exact same thing in C++, only with pointers instead of normal objects.

However, the problem with working with pointers like this is that it gets messy kind of fast. Now it become much easier for everything to work with pointers, which means I have to worry about deallocation, and maybe memory leaks if I don't do it right. It just becomes a mess.

Now, the solution seems easy. I could just return PrefixExpression like this:

Expression parse(Parser& parser, Token token) {
    Expression operand = parser.parseExpression();
    return PrefixExpression(token.getType(), operand);

Here's my problem: if I do it like this, I lose the vtable and any extra data in this new Expression. That's a problem since Expression is actually just a base class for many types of expressions. Parse can parse anything it wants to, not just a PrefixExpression. That's how the original was designed. Generally, I like that design, but, as you can see, it's causing problems. Simply returning a new Expression right here loses things I need from that object later.

Now, I can try to solve this by returning a reference:

Expression& parse(Parser& parser, Token token) {
    // ...
    return PrefixExpression(token.getType(), operand);

That solves the vtable and extra data problem, but now that creates a new one. I'm returning a reference to a variable that will be destroyed instantly, which is of no help.

All of this to say, that's why I originally ultimately went with pointers. Pointers let me keep the data I needed later, but they are really hard to work with. I can squeeze by, but personally I'd like something better.

I think I could use std::move, but I'm not familiar with that enough to be certain I'd be using it properly. If I have to I will, but implementing that properly takes some skill and knowledge I just don't have. Besides, that is a lot of work to rework everything I have to work that way up to this point.

All of that to lead to the main point of my question: can I simply just return a reference to a new object safely? Let me just show an example:

Expression& parse(Parser& parser, Token token) {
    //...
    return *(new PrefixExpression(token.getType(), operand));

This would be nice and solve most of my problems because, if it does what I think it does, I get a reference to a new object, keep the vtable and extra data, and it doesn't get destroyed immediately. This would let me have my cake and eat it too.

However, my problem is can I actually do this? While I feel I have a good reason to do this, this to me seems very weird. I'm allocating new data inside a function, and expecting it to get deallocated outside the function automatically like any normal variable. Even if that did work, would that behave as I would expect it to outside this function completely? I am scared that this might be invoking undefined behavior or something like that. What does the standard think of this?

Edit: So here is a requested minimal sample:

Expression:

    // A (not really pure) purely virtual base class that holds all types of expressions
    class Expression {
        protected:
            const std::string type;
        public:
            Expression() : type("default") {}
            virtual ~Expression() {} //Because I'm dealing with pointers, I *think* I need a virtual destructor here. Otherwise, I don't really need 

            virtual operator std::string() {
                // Since I am working with a parser, I want some way to debug and make sure I'm parsing correctly. This was the easiest.
                throw ("ERROR: No conversion to std::string implemented for this expression!");
            }
            // Keep in mind, I may do several other things here, depending on how I want to use Expression
};

A child Expression, for Parenthesis:

    class Paren : public Expression {
        private:
            // Again, Pointer is not my preferred way, but this was just easier, since Parse() was returning a pointer anyway.
            Expression* value;
        public:
            Paren(Expression *e) {
                // I know this is also sketchy. I should be trying to perform a copy here. 
                // However, I'm not sure how to do this, since Expression could be anything.
                // I just decided to write my code so the new object takes ownership of the  pointer. I could and should do better 
                value = e;
            }

            virtual operator std::string() {
                return "(" + std::string(*value) + ")";
            }

            // Because again, I'm working with pointers
            ~Paren() {delete value;}
    };

And a parser:

class Parser {
    private:
        Grammar::Grammar grammar;
    public:
        // this is just a function that creates a unique identifier for each token.
        // Tokens normally have types identifier, number, or symbol.
        // This would work, except I'd like to make grammar rules based off
        // the type of symbol, not all symbols in general
        std::string GetMapKey(Tokenizer::Token token) {
                if(token.type == "symbol") return token.value;
                return token.type;
        }
        // the parsing function
        Expression * parseExpression(double precedence = 0) {
            // the current token
            Token token = consume();

                // detect and throw an error here if we have no such prefix
                if(!grammar.HasPrefix(GetMapKey(token))) {
                    throw("Error! Invalid grammar! No such prefix operator.");
                }

                // get a prefix parselet 
                Grammar::PrefixCallback preParse = grammar.GetPrefixCallback(GetMapKey(token));

                // get the left side
                Expression * left = preParse(token,*this);

                token = peek();

                double debug = peekPrecedence();

                while(precedence < peekPrecedence() && grammar.HasInfix(GetMapKey(token))) {
                    // we peeked the token, now we should consume it, now that we know there are no errors
                    token = consume();

                    // get the infix parser
                    Grammar::InfixCallback inParse = grammar.GetInfixCallback(GetMapKey(token));


                    // and get the in-parsed token
                    left = inParse(token,left,*this);
                }

                return left;
            }

After I posted the parser code, I realized I should mention that I put all the grammar related stuff into its own class. It just has some nice utilities related to grammar, as well as allows us to write a grammar independent parser and worry about the grammar later:

    class Grammar {
        public:
            // I'm in visual studio 2010, which doesn't seem to like the using type = value; syntax, so this instead
            typedef std::function<Expression*(Tokenizer::Token,Parser&)> PrefixCallback;
            typedef std::function<Expression*(Tokenizer::Token, Expression*, Parser&)> InfixCallback;
        private:
            std::map<std::string, PrefixCallback> prefix;
            std::map<std::string, InfixCallback> infix;
            std::map<std::string, double> infixPrecedence; // we'll use double precedence for more flexabillaty
        public:
            Grammar() {
                prefixBindingPower = std::numeric_limits<double>::max();
            }

            void RegisterPrefix(std::string key, PrefixCallback c) {
                prefix[key] = c;
            }

            PrefixCallback GetPrefixCallback(std::string key) {
                return prefix[key];
            }

            bool HasPrefix(std::string key) {
                return prefix.find(key) != prefix.end();
            }

            void RegisterInfix(std::string key, InfixCallback c, double p) {
                infix[key] = c;
                infixPrecedence[key] = p;
            }

            InfixCallback GetInfixCallback(std::string key) {
                return infix[key];
            }

            double GetInfixPrecedence(std::string key) {
                return infixPrecedence[key];
            }

            bool HasInfix(std::string key) {
                return infix.find(key) != infix.end();
            }
    };

Finally, I probably need to show a parsing callback to complete the set:

    Expression* ParenPrefixParselet(Tokenizer::Token token, Parser& parser) {
        Expression* value = parser.parseExpression(0);
        Expression* parenthesis = new Paren(value); // control of value gets given to  our new expression. No need to delete
        parser.consume(")");

        return parenthesis;
    }

That allows me to write a grammar that allows for things in parenthesis like this:

Grammar g;
g.RegisterPrefix("(", &ParenPrefixParselet);

Finally, a main():

int main() {
    Grammar g;
    g.RegisterPrefix("(", &ParenPrefixParselet);
    Parser parser(g);

    Expression* e = parser.parseExpression(0);

    std::cout << static_cast<std::string>(*e);

    return 0;
}

Believe it or not, I think that's pretty minimal. Remember, this is a parser. Keep in mind, that as a minimal example, I plan on it being expanded, but hopefully you get the idea.

  • It would be impossible to delete that object because you use new and then discard the pointer. But my first impression is not UB, just a memory leak. My first instinct is to just return an object and let c++ take care of it. "return PrefixExpression(token.getType(), operand);" – Kenny Ostrom Jun 17 '19 at 22:29
  • "*Here's my problem: if I do it like this, I lose the vtable and any extra data in this new `Expression`*" - why? Isn't the relevant data consumed by `PrefixExpression` constructor? Are your objects following the [Rule of three](https://stackoverflow.com/questions/4172722/what-is-the-rule-of-three)/five/zero? – Fureeish Jun 17 '19 at 22:30
  • 1
    @Fureeish Because `Expression` is an abstract class. It has no real implementation. `Expression` itself has a virtual destructor, but other than that it follows the rule of 0, I think--unless having a virtual destructor alone violates the rule of 0/3/5 –  Jun 17 '19 at 22:34
  • Might need a factory constructor. Can you make a minimal code example, and we can deal with any issues. If you're keeping a collection of pointers to the abstract base class, you need to construct the actual concrete class. – Kenny Ostrom Jun 17 '19 at 22:35
  • @Fureeish Remember, I'm returning an `Expression`, not a `PrefixExpression`. The upper lever parsing code only sees an expression. If I need something from `PrefixExpression`, I would need to cast it, theoretically. –  Jun 17 '19 at 22:36
  • If `Expression` is an abstract class then `Expression operand = ...` will not compile. I get your original point now, though. I would then suggest using `std::unique_ptr`. – Fureeish Jun 17 '19 at 22:39
  • @Fureeish does `unique_pointer` handle deletion when it just goes out of scope? Like could I return a reference to a `unique_pointer`, and that would give me everything I need? –  Jun 17 '19 at 22:41
  • @KennyOstrom given the last implementation, would it really be impossible to delete that object? What would be wrong with `auto& x = parse(...); delete &x;`? The address of `x` was returned by a `new`, so `delete` should have no problem with it. – Fureeish Jun 17 '19 at 22:41
  • @Fureeish Also, my code does compile, so I must not be explaining something correctly. Why would it not compile? –  Jun 17 '19 at 22:42
  • @KennyOstrom Also, trying to work on that example. –  Jun 17 '19 at 22:42
  • `std::unique_ptr` handles deletion by itself when it goes out of scope, correct. You just return plain `std::unique_ptr`, not a reference to it, though. Regarding the code not compiling - you cannot instantiate an abstract class. You must've used the term incorrectly. Are any of the `Expression` methods pure virtual? If not, it's not an abstract class. – Fureeish Jun 17 '19 at 22:44
  • @Fureeish Ah. Yes. I should mention that I am returning objects that inherit from Expression. You know, now that you mention, `Expression` does actually have an implementation, I just throw an error. So I guess technically, it's not really an abstract class in the technical sense, only the practical sense. Calling a function will throw an error immediately. –  Jun 17 '19 at 22:47
  • @Fureeish And that may solve my problem. I'll look into it. In the mean time, I'll still try to give an example to show more context. –  Jun 17 '19 at 22:48
  • Possible duplicate of [Is there a platform or situation where dereferencing (but not using) a null pointer to make a null reference will behave badly?](https://stackoverflow.com/questions/9383684/is-there-a-platform-or-situation-where-dereferencing-but-not-using-a-null-poin) – Raymond Chen Jun 17 '19 at 22:51
  • @RaymondChen This is not a duplicate. That deals with null pointer. This is about allocating new memory. –  Jun 17 '19 at 22:54
  • @RaymondChen while the linked question and answers provide some quality information regarding the topic, they mainly focus on UB, how compilers treat it and what happens when we try to null a reference. I do not think it's a good duplicate. – Fureeish Jun 17 '19 at 22:54
  • @KennyOstrom The example. –  Jun 17 '19 at 23:30
  • @Chipster: "*This would let me have my cake and eat it too.*" Where's the cake? Your original problem was *not* with pointers or references, but with *lifetime issues*: " I have to worry about deallocation, and maybe memory leaks if I don't do it right." That doesn't go away just because you're using a reference. If you allocate a thing with `new`, you *have to `delete` it*, period. You've changed nothing. – Nicol Bolas Jun 17 '19 at 23:52
  • @NicolBolas Good point. –  Jun 17 '19 at 23:53
  • @NicolBolas I was just hoping that C++ would see this a regular object. When it went out of scope, it's destructor would get called. I'm not really sure what I was thinking now, though, now that you mention it. It is still a reference, and I used new, so it ultimately needs delete. –  Jun 17 '19 at 23:55
  • Ah, I misunderstood. I thought this was looking at dereferencing a potentially failed `new` operation, rather than dereferencing a known-good pointer. It's not clear how you would manage the lifetime of the returned reference. – Raymond Chen Jun 18 '19 at 00:20
  • What a lot of compilers do is just not deallocate the memory. The AST is going to live during the whole program anyways since you need it for every step of compilation. Deallocation would then happen when the compiler is done with everything and at that point you could just let the OS deallocate the memory for you :) – Rakete1111 Jun 18 '19 at 07:18

2 Answers2

4

You wish to use polymorphism - there are two ways. Either use references or pointers. The thing with references is that it's dangerous when you return them. It's most of the time UB when you return a reference to a local object. That means we're left with pointers.

But don't use new and delete. They are unsafe, hard to deal with, especially in multi-scope enviroment. Use a smart pointer. Use a unique_ptr:

#include <memory>

struct expression {
    virtual void foo() = 0;
    virtual ~expression() = default;
};

struct prefix_expression : expression {
    virtual void foo() { /* default impl */ }

    // dummy c-tor
    prefix_expression(int) {}
};

// note that parse() returns a pointer to any *expression*!
std::unique_ptr<expression> parse() {
    // pass to make_unique whatever arguments the constructor of prefix_expression needs
    return std::make_unique<prefix_expression>(42);
}

int main() {
    {
        auto expr = parse();
        // here, *expr* goes out of score and properly deletes whatever it has new-ed
    }
}

EDIT:

To also answer the question in the title - no.

Fureeish
  • 9,869
  • 3
  • 23
  • 46
  • I appreciate the lengths you have gone to help me, but I am still interested in the standard's opinion. A quote of the standard is what I'm looking for. –  Jun 17 '19 at 23:28
  • 2
    "*A quote of the standard is what I'm looking for*" - quote regarding what exactly? – Fureeish Jun 17 '19 at 23:38
  • Returning a reference to data allocated in the heap during the scope of the function. I understand that returning a reference to a local object is bad because the reference quickly gets destroyed. What I am wondering is does using `new` change that and make it work almost like a normal variable to the outside world? What does the standard say? –  Jun 17 '19 at 23:43
  • 4
    @Chipster I believe if you do `int& foo() { return *new int{}; }; ... auto& x = foo(); delete &x;`, the code should behave as expected. But I would **strongly** advise against such ideas. First, you need to remember to capture the return type by reference. Then you'd need to remember to `delete &x`, which looks, well, *silly*. Lastly, you have no scope guards - if you don't `delete`, you end up with a memory leak. Smart pointer fixes all of that. – Fureeish Jun 17 '19 at 23:46
  • Awesome. Thanks. That's what I needed, I think. –  Jun 17 '19 at 23:48
  • also note that parse follows the "factory method pattern" – Kenny Ostrom Jun 18 '19 at 12:33
  • @KennyOstrom Please explain. I think I'm making a Pratt Parser. I'm not sure what the "factory method pattern" is. Maybe a relevant link so I could read up on it? –  Jun 18 '19 at 20:11
4

You're right - you want a pointer, and to get around scoping you'll need dynamic allocation.

Java's already doing that for you behind the scenes.

Don't use new, though, use a smart pointer, so that it doesn't get messy.

We can't provide "a quote from the standard" about this because we'd have to quote 20 or 30 pages of rules, from how automatic storage duration works, to how dereferencing works, to how lvalues work, to how copying works, to how inheritance works, to how virtual member functions work, etc, etc.

Lightness Races in Orbit
  • 358,771
  • 68
  • 593
  • 989
  • Okay. Then maybe a quote about returning a reference to memory allocated in the heap during the scope of that function? I hardly think a quote about how "virtual member functions work" or "how inheritance works", or even "how copying works" would be relevant here at all (except maybe because my example uses all of those things, but I since it uses all of those things, then I must know how that works...). My question is only about using `new` to return a reference to an object and avoid using pointer semantics. –  Jun 17 '19 at 23:47
  • @Chipster What makes you think there is a specific quote in the standard about "returning a reference to memory allocated in the heap during the scope of that function?" There's a chapter about functions, and a chapter about dynamic allocation, and a chapter about dereferencing, and a chapter about references... it doesn't have a quote for each conceivable use case. I don't really follow what you're looking for and why. Do you have a question/doubt about the proposed solution? If so, you should state it. – Lightness Races in Orbit Jun 17 '19 at 23:55
  • No question/doubt. I thought that because I'm an idiot :D I see that now. Thanks for your help. –  Jun 18 '19 at 00:09