Whats the cost of calling a virtual function in a non-polymorphic way?

Question

I have a pure abstract base and two derived classes:

struct B { virtual void foo() = 0; };
struct D1 : B { void foo() override { cout << "D1::foo()" << endl; } };
struct D2 : B { void foo() override { cout << "D1::foo()" << endl; } };

Does calling foo in Point A cost the same as a call to a non-virtual member function? Or is it more expensive than if D1 and D2 wouldn't have derived from B?

int main() {
 D1 d1; D2 d2; 
 std::vector<B*> v = { &d1, &d2 };

 d1.foo(); d2.foo(); // Point A (polymorphism not necessary)
 for(auto&& i : v) i->foo(); // Polymorphism necessary.

 return 0;
}

Answer: the answer of Andy Prowl is kind of the right answer, I just wanted to add the assembly output of gcc (tested in godbolt: gcc-4.7 -O2 -march=native -std=c++11). The cost of the direct function calls is:

mov rdi, rsp
call    D1::foo()
mov rdi, rbp
call    D2::foo()

And for the polymorphic calls:

mov rdi, QWORD PTR [rbx]
mov rax, QWORD PTR [rdi]
call    [QWORD PTR [rax]]
mov rdi, QWORD PTR [rbx+8]
mov rax, QWORD PTR [rdi]
call    [QWORD PTR [rax]]

However, if the objects don't derive from B and you just perform the direct call, gcc will inline the function calls:

mov esi, OFFSET FLAT:.LC0
mov edi, OFFSET FLAT:std::cout
call    std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)

This could enable further optimizations if D1 and D2 don't derive from B so I guess that no, they are not equivalent (at least for this version of gcc with these optimizations, -O3 produced a similar output without inlining). Is there something preventing the compiler from inlining in the case that D1 and D2 do derive from B?

"Fix": use delegates (aka reimplement virtual functions yourself):

struct DG { // Delegate
 std::function<void(void)> foo;
 template<class C> DG(C&& c) { foo = [&](void){c.foo();}; }
};

and then create a vector of delegates:

std::vector<DG> v = { d1, d2 };

this allows inlining if you access the methods in a non-polymorphic way. However, I guess accessing the vector will be slower (or at least as fast because std::function uses virtual functions for type erasure) than just using virtual functions (can't test with godbolt yet).

There is no reason the compiler couldn't inline the calls if `D1` and `D2` are derived from `B` for the direct calls. — Daniel Frey, Feb 17 '13 at 16:18
You could not time the difference in those instruction sets. — Martin York, Feb 17 '13 at 17:51
Nothing prevents compiler from inlining `D1::foo()`, `D2::foo()`. It's some `GCC 4.7` and above glitch. `GCC 4.5` inlined this with no problems. `clang 3.4.1` inlined this as well. — doc, Oct 17 '14 at 14:03
It still fails with gcc-4.9 (tip-of-trunk) -O3 -march=native -DNDEBUG (see code and assembly here: http://goo.gl/NKm3Uz). It should inline these calls since we have a single TU. In a more complex program, and unless you use `final`, it is very hard to inline these even with LTO, since you can always create a new TU in which you derive from a class (and a dynamic library could do so too). IIRC Herb Sutter described the issue as "with virtual inheritance you pay for infinite extensibility", and that has a cost. — gnzlbg, Oct 17 '14 at 14:33
Furthermore, with virtual inheritance, the interface (or all possible interfaces, unless you use the Adapter pattern) get put in the vtable with the object, and this vtable can become quiet large. The delegate provides a smaller interface (and vtable), and this improves the cache usage in loops. — gnzlbg, Oct 17 '14 at 14:42
Addendum 2: clang 3.4.1 in godbolt inlines the calls, but the assembly size increases from ~184 lines to ~233. It is hard to know what this means tho. — gnzlbg, Oct 17 '14 at 14:45
Does your "fix" really allow inlining? I think the opposite is true as any new value may be assigned to `foo` during runtime. Correct me if I'm wrong. — doc, Oct 17 '14 at 15:50
First of all don't mess virtual inheritance with virtual functions. Virtual inheritance is a concept on its own. Inheritance itself does not cost anything and vtable for virtual functions is kept per class not per object, so the cost in terms of memory is negligible. Your delegate struct is larger, because you store pointers in objects. — doc, Oct 17 '14 at 16:03
@gnzlbg Re: clang. There's no difference at all. non-virtual foo -> http://goo.gl/RcXWDO vs virtual foo -> http://goo.gl/bH6V3W — doc, Oct 17 '14 at 16:38
@doc the fix allows inlining (like a normal function) where you don't use a "pointer to base" kind of type erasure (the delegate). When you use the std::function tho, there is a virtual function inside of it. The same devirtualization techniques can be used for that, but in my experience (and I hope that changes) devirtualization almost always fails. — gnzlbg, Oct 18 '14 at 08:17
@doc you are right, the vtable is kept per class, and the objects just have a pointer to its vtable. The same is true for the thing inside std::function, whose use has a larger memory overhead, and the vtable of the "type erasure" should have the same indirections. I don't know why it is faster than virtual functions tho, but other people have measured similar speedups to mine: http://probablydance.com/2013/01/13/a-faster-implementation-of-stdfunction/ — gnzlbg, Oct 18 '14 at 08:22
@doc yes, in your case with clang there is no difference. But that case is not the same as the one I showed, which uses polymorphic access in one place, and direct access in the others. I chose that one to see what would happen in my code, where I need polymorphic access in some places, and direct access in others. Anyhow, the situation has improved a bit in 1.5 years, but it is IMO still far away from being good. — gnzlbg, Oct 18 '14 at 08:26
@gnzlbg in your case there's just additional code for polymorphic call, but it will not affect in any way the direct call you made later. Polymorphic call, under the hood, is really nothing else than just calling a function with pointer referring to its address. You do almost the same thing explicitly with your delegate struct. Your method may be faster because you use one pointer per function and indexed access to vtable is required to call virtual function. But this this is so called micro-optimization and it highly depends on hardware architecture and in some cases vtable may be faster. — doc, Oct 18 '14 at 15:41

score 8 · Accepted Answer · answered Feb 17 '13 at 15:54

8

Does calling foo in Point A cost the same as a call to a non-virtual member function?

Yes.

Or is it more expensive than if D1 and D2 wouldn't have derived from B?

No.

The compiler will resolve these function calls statically, because they are not performed through a pointer or through a reference. Since the type of the objects on which the function is called is known at compile-time, the compiler knows which implementation of foo() will have to be invoked.

answered Feb 17 '13 at 15:54

Andy Prowl

114,596
21
355
432

See assembly code in the answer. It should, but in practice it isn`t cause the compiler won`t inline. – gnzlbg Feb 17 '13 at 16:06
1

@gnzlbg: Have you tried with heavier optimization, like -O3? I don't see what is preventing the compiler from inlining those calls. – Andy Prowl Feb 17 '13 at 16:20
Yes. I also tried -O3 but gcc didn't inline the calls. I don't think there is something preventing the compiler to do it tho. Godbolt's clang is not working for me right now, it might do it differently. – gnzlbg Feb 17 '13 at 16:35
I know it takes more instructions. But does it cost more? If you are talking about time on a modern machine with a modern OS. Then I doubt you could actually time them in such a way as to see a difference. There are so many things happening on a modern OS that microscope differences like this will be completely swamped by the random interrupts and other things the CPU is doing. – Martin York Feb 17 '13 at 17:49
1

@LokiAstari It can make a difference if the function you are calling does almost no work and is a/the hotspot of your application. Sometimes this functions are intended to be transparent and you rely on them beeing inlined. The fact that even tho they are not used polymorphically they can't be inlined is IMO interesting at least. In a tight loop the fact that the function is inlined can be the difference between the loop being unrolled or not. Still lots of can's, might's, if's.... – gnzlbg Feb 17 '13 at 17:54
@gnzlbg: If you are using polymorphic types in a real application and you are at a point where dynamic dispatch is not being used and it is a hot spot yes. But in this case I would ague you have designed your application wrong. This a micro optimization that is rarely worth worrying about. Better design is a more pressing issue than macro optimizations. – Martin York Feb 17 '13 at 18:01
1

@LokiAstari Really? I think that what's happening here is highly counterintuitive. Accepting the cost of a polymorphic type in a small non-critical part of my application doesn't mean I'm accepting the cost in places in which I know that the compiler knows exactly which function of which type I'm calling. It's just wrong. I rather use delegates and pay a bit more in the non-critical part than inheritance and pay the price everywhere. It's a non-trivial inheritance vs delegate-based polymorphism decision that one has to make. – gnzlbg Feb 17 '13 at 18:12
@gnzlbg: Where you don't need to pay the price the compiler has already optimized it out. So we are really only talking about polymorphic calls vs normal calls. The difference in cost of a dynamic dispatched call vs a normal call is insignificant enough that you can not actual time the difference in a program running on a modern OS (if you were running on a chip with no interrupts and no OS and no memory caching maybe you could time the difference accurately). As for your delegate solution (I am not sure how that fits in yet). – Martin York Feb 17 '13 at 19:28
@LokiAstari "Where you don't need to pay the price the compiler has already optimized it out." Kind of. It removes the virtual call but it doesn't inline the function. For some weird reason both cases aren't treated equally (at least in gcc 4.7.2 with -O3). Will it be faster/slower? Most of the time you won't be able to tell the difference. In some special situations, not inlining will be faster. In others, inlining could enable further optimizations. The intuition breaking part is tho, that both cases should be treated equally, but aren't. – gnzlbg Feb 18 '13 at 15:37
@gnzlbg: If the compiler can inline and its heuristics for deciding it should evaluate to true it will otherwise it will not. Fortunately the compiler is much smarter than humans on deciding when to inline and will only do so when it is a benefit. If the compiler is not in-lining when you think it should it is because you have not considered all the possibilities. **BUT** this is completely orthogonal to my original comment. I am **only** arguing that the cost of dynamic dispatch call is insignificantly more than a normal call (nothing else). Thus the cost is not measurable (on a modern OS). – Martin York Feb 18 '13 at 15:40
1

@gnzlbg delegate does practically same thing as vtable - it is indirect call to a function denoted by a pointer. If you need such extreme performance tweaks in your application then you should use compile-time polymorphism. – doc Oct 17 '14 at 15:44
@doc I ended up doing that with tuples and boost.fusion in a couple of places. In the rest I used a class of std::function's as an interface. It is a single class that can adapt all my objects and provides default implementations. It has a slightly larger memory consumption, but is really fast and has been very easy to extend. I don't know if an Adaptor class hierarchy would have been better tho, haven't thought about it much. What I have thought about is of having a vector of interfaces, where each std::function is in its own vector, but haven't tried it yet (might not be worth the hassle). – gnzlbg Oct 18 '14 at 08:31
@gnzlbg you tend to try to avoid inheritance at all cost, while IMO there is no reason for this. You only make your software unnecessary more complex. And do you know that "pre-mature optimization is the root of all evil"? This is original quote by Knuth and "inheritance is a root of evil" sounds like (silly) paraphrase of this quote by some reactionaries. ""We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" -> http://en.wikipedia.org/wiki/Program_optimization – doc Oct 18 '14 at 15:22
When I tried it out I decided that I would maintain the mess if the speedup was > 1.5x. Removing inheritance in those two places gave my app a 2x speed-up in some regression tests so I decided to clean it up as much as possible and keep it. – gnzlbg Oct 18 '14 at 16:07
@gnzlbg is this real app or just simple tests like those above? If your functions don't do anything significant then it's obvious that time is spent on just function calls. – doc Oct 18 '14 at 16:21
@gnzlbg It will give you less and less speed up if your functions get more complex. And the overkill will be algorithm complexity not micro-optimizations. This is visible if you compare sorting algorithms -> http://www.sorting-algorithms.com/ Even if single operation would be 100000 x faster in selection sort than in quick sort, quick sort would finish first for large enough problem. – doc Oct 18 '14 at 16:33
@doc real app, and the functions do work (they are non-trivial). The problem is not the indirection cost per se, the problem is that the compiler cannot see past the indirection, and this prevented a trivial vectorization which gave a 2x speed-up. I wanted both generic and fast code, and virtual functions are generic, but they are not a zero-cost abstraction. In some domains and most parts of your applications you don't care. But in the parts that you do, they force you to still pay the price. – gnzlbg Oct 19 '14 at 09:51
And I mean, the tests I wrote above are trivial, and 1.5 years ago compilers couldn't inline them. Nowadays even modern compilers still have trouble _on trivial cases_. Most of the time you don't care, but today if someone on code review argues about a design decision with "it will get devirtualized" I would immediately suggest the assembly to be checked (it takes a minute to do so). Most of the time you don't care (and thus not argue about it). But compilers are still pretty bad at this, and in my field a 2x speed up is a lot of money. – gnzlbg Oct 19 '14 at 09:59
@gnzlbg virtual functions are as old as C++ - it's not something that was born 1.5 years ago. Of course compilers were able to inline them - test older versions of gcc and it inlines them. Relying on disassembly from particular compiler is IMO really bad idea, but after all it's not my business and I think it's time to end this discussion. – doc Oct 19 '14 at 13:53

score 4 · Answer 2 · answered Feb 17 '13 at 16:58

Simplest solution is looking at the compilers innards. In Clang we find canDevirtualizeMemberFunctionCall in lib/CodeGen/CGClass.cpp:

/// canDevirtualizeMemberFunctionCall - Checks whether the given virtual member
/// function call on the given expr can be devirtualized.
static bool canDevirtualizeMemberFunctionCall(const Expr *Base, 
                                              const CXXMethodDecl *MD) {
  // If the most derived class is marked final, we know that no subclass can
  // override this member function and so we can devirtualize it. For example:
  //
  // struct A { virtual void f(); }
  // struct B final : A { };
  //
  // void f(B *b) {
  //   b->f();
  // }
  //
  const CXXRecordDecl *MostDerivedClassDecl = getMostDerivedClassDecl(Base);
  if (MostDerivedClassDecl->hasAttr<FinalAttr>())
    return true;

  // If the member function is marked 'final', we know that it can't be
  // overridden and can therefore devirtualize it.
  if (MD->hasAttr<FinalAttr>())
    return true;

  // Similarly, if the class itself is marked 'final' it can't be overridden
  // and we can therefore devirtualize the member function call.
  if (MD->getParent()->hasAttr<FinalAttr>())
    return true;

  Base = skipNoOpCastsAndParens(Base);
  if (const DeclRefExpr *DRE = dyn_cast<DeclRefExpr>(Base)) {
    if (const VarDecl *VD = dyn_cast<VarDecl>(DRE->getDecl())) {
      // This is a record decl. We know the type and can devirtualize it.
      return VD->getType()->isRecordType();
    }

    return false;
  }

  // We can always devirtualize calls on temporary object expressions.
  if (isa<CXXConstructExpr>(Base))
    return true;

  // And calls on bound temporaries.
  if (isa<CXXBindTemporaryExpr>(Base))
    return true;

  // Check if this is a call expr that returns a record type.
  if (const CallExpr *CE = dyn_cast<CallExpr>(Base))
    return CE->getCallReturnType()->isRecordType();

  // We can't devirtualize the call.
  return false;
}

I believe the code (and accompanying comments) are self-explanatory :)

So if the member function / class are final but not marked final then the call won't be devirtualized? (and by marked I mean either by me, or by the compiler itself). — gnzlbg, Feb 17 '13 at 17:42
@gnzlbg: it's a bit more complicated than that. Basically, the goal here is to determine the *final overrider*. If it's marked final, then you know it is; if it is not, then you may still be able to devirtualize the call providing you are able to determine the dynamic type of the object *statically*, such as in `int main() { Derived d; Base& b = d; b.foo(); }` where `b` is "obviously" a reference to an instance of `Derived`. — Matthieu M., Feb 17 '13 at 18:30

Whats the cost of calling a virtual function in a non-polymorphic way?

2 Answers2

Linked