software rasterization implementation speed up ideas

Question

I am writing software rasterization of triangles. In the example bellow I am rasterizing 1 full screen triangle and its still damned slow...on my i7 it runs about 130 FPS for 1 polygon at 1024x768.

I can see some space for optimalization but still not even close to performance I was hopping to get. What am I missing, any ideas how to speed this up rapidly? ...even Quake2 was running faster in software rendering on Pentium 2 with more polygons I guess :/

Thanks for any suggestions!

Here is the code (I am using gmtl for Vectors)

float cross(Vec2f const &edge1, Vec2f const &edge2)
{
    return edge1[0] * edge2[1] - edge1[1] * edge2[0];
}

bool caculateBarycentricWeights(float invDoubleTriArea, float doubleTriArea, Vec2f const (&edgesFromTo01_12_20)[3], Vec2f const (&vertex)[3], Vec2f const &point, float(&outWeight)[3])
{
    outWeight[0] = cross(point - vertex[2], edgesFromTo01_12_20[1]);
    outWeight[1] = cross(point - vertex[0], edgesFromTo01_12_20[2]);
    outWeight[2] = cross(point - vertex[1], edgesFromTo01_12_20[0]);

    if (outWeight[0] >= 0.0f &&  outWeight[1] >= 0.0f &&  outWeight[2] >= 0.0f
        && outWeight[0] + outWeight[1] + outWeight[2] <= doubleTriArea * 1.000001f)
    {
        outWeight[0] *= invDoubleTriArea;
        outWeight[1] *= invDoubleTriArea;
        outWeight[2] *= invDoubleTriArea;

        return true;
    }
    else
    {
        return false;
    }
}

void rasterize(Vec3f const &vertex0, Vec3f const &vertex1, Vec3f const &vertex2, vector<Vec3f> &dstBuffer)
{
    Vec2f const edgesFromTo01_12_20[3] =
    {
        Vec2f(vertex1[0] - vertex0[0], vertex1[1] - vertex0[1]),
        Vec2f(vertex2[0] - vertex1[0], vertex2[1] - vertex1[1]),
        Vec2f(vertex0[0] - vertex2[0], vertex0[1] - vertex2[1])
    };

    //in viewport space up and down is switched, thats why '-'
    float const doubleTriangleArea = -cross(edgesFromTo01_12_20[0], edgesFromTo01_12_20[1]);
    float const invDoubleTriangleArea = 1.0f / doubleTriangleArea;


    float weight[3];

    float invZ[3] = { 1.0f / vertex0[2], 1.0f / vertex1[2], 1.0f / vertex2[2] };

    Vec2f p(vertex0[0], vertex0[1]);

    Vec2f vPos[3] = {
        Vec2f(vertex0[0], vertex0[1]),
        Vec2f(vertex1[0], vertex1[1]),
        Vec2f(vertex2[0], vertex2[1])
    };

    const size_t RES_X = 1024;
    const size_t RES_Y = 768;

    for (size_t y = 0; y < RES_Y; ++y)
    {
        p[1] = y;

        for (size_t x = 0; x < RES_X; ++x)
        {
            p[0] = x;

            if (caculateBarycentricWeights(invDoubleTriangleArea, doubleTriangleArea, edgesFromTo01_12_20, vPos, p, weight))
            {
                //interpolate values
                float z = 1.0f / (weight[0] * invZ[0] + weight[1] * invZ[1] + weight[2] * invZ[2]);
                dstBuffer[y * RES_X + x] = ((vertex0 * (weight[0] * invZ[0])) + (vertex1 * (weight[1] * invZ[1])) + (vertex2 * (weight[2] * invZ[2]))) * z;

            }
        }
    }
}


int main(int argc, char *argv[])
{   
    vector<Vec3f> buffer;
    buffer.resize(1024 * 768);

    high_resolution_clock::time_point start = high_resolution_clock::now();
    size_t fps = 0;

    while (true) 
    {

        static Vec3f v0 = { -2000.0f, -2000.0, 10.0 };
        static Vec3f v1 = { 2000.0f, -2000.0, 10.0 };
        static Vec3f v2 = { 0.0f, 2000.0, 10.0 };

        rasterize(v0, v2, v1, buffer);
        ++fps;

        high_resolution_clock::time_point now = high_resolution_clock::now();
        auto timeDiff = duration_cast<microseconds>(now - start).count();

        static const long oneSecond = 1000000;
        if (timeDiff >= oneSecond)
        {

            cout << fps << " ";

            fps = 0;
            start = now;
        }

    }


    return 0;
}

This is what I get when using CPU sampling profiling

enter image description here

Not an issue for a fullscreen geometry, but your for-loops (x/y) should only iterate over the bounding box of the triangle, not over the whole screen. — BDL, Sep 26 '18 at 14:31
Yes I do that, this is just simplified code...but still its just 1 fullscreen triangle at 120FPS? — Martin, Sep 26 '18 at 14:33
Requests for feedback on working code should be made at https://codereview.stackexchange.com/. — François Andrieux, Sep 26 '18 at 14:34
Using a std::vector with 3 floats per pixel might also be an issue. Try a c++ array (new/delete) and use 1 byte per channel. This should improve memory coherency and reduces the memory by a factor of 4 (don't know if integer instead of floating point calculation will make a difference). — BDL, Sep 26 '18 at 14:37
@FrançoisAndrieux: Yes, they are better suited at codereview, but they are also on-topic here. — BDL, Sep 26 '18 at 14:38
Writing into array of shorts doesnt improve the performance (just tested) Removing floating point calculations would probably help, but I will need them later ;) — Martin, Sep 26 '18 at 14:54
Could you remove all calculations from rasterize and just fill the buffer with a constant value? This would establish a baseline for performance (you definitely won't get faster than that). — BDL, Sep 26 '18 at 15:08
You should also check out Mesa's swrast module to see what kind of performance you're getting with that. If it's not much different it means you're not doing anything wrong. Maybe the quake software renderer had assembly code? https://dri.freedesktop.org/docs/mesa/swrast/files.html — Frederik De Ruyck, Sep 26 '18 at 15:45
It would help if one could look at the assembly for this. Any chance you could turn this into a compilable example? It's mainly missing the type definitions. — Max Langhof, Sep 26 '18 at 15:47
Also, have you tried relaxed floating point math? It can yield some substantial speedups (and is generally acceptable for rendering calculations). — Max Langhof, Sep 26 '18 at 15:50
Why not rasterize only 3 tips of that triangle, then draw 3 lines between them on the resulting image itself, then fill pixels in interior part of those 3 intersecting lines? You can deduce a function that does the conversion of triangle x,y,z to screen x,y,z. Probably using inverse of the calculations you do, or by training a neural network to do it for you. Since moving a triangle and its rasterized x,y function could be a differentiable function. — huseyin tugrul buyukisik, Sep 26 '18 at 19:43
This one is low hanging fruit -- make sure both `cross` and `caculateBarycentricWeights` get inlined. (Forcing inlining doubled the throughput in my tests). — Dan Mašek, Sep 26 '18 at 19:48
Is the test for `<= doubleTriArea * 1.000001f` necessary? It never seems to be the case in this scenario. You can also premultiply `invZ` by `invDoubleTriArea`, and drop that from `caculateBarycentricWeights`. That runs at ~170 FPS on a 3.4Ghz Sandy Bridge i7 -- which is on average 25 clock cycles per pixel. Given how much floating point math there is per pixel and that we're rendering on most of the screen... that's not so bad. Switching to fixed point representation and using integer arithmetics would most likely allow more. The more you can do with integers, the better. — Dan Mašek, Sep 26 '18 at 22:50

Scheff's Cat · Answer 1 · 2018-10-03T15:58:47.147

I believe I must be able to beat 130 FPS.

...even Quake2 was running faster

I must admit that I got my inspiration from OpenGL – resembling a few parts of it (at least, how I believe it could work) and with some simplifications.

I must admit also that I didn't understand what exactly the caculateBarycentricWeights() function is good for.

However, I based my implementation on the following concept:

rasterizer must work in screen space
3D vertices have to be converted to screen space before rasterizing
The inner rastering loop has to be as simple as possible – branches (i.e. if and a complex function) in the most inner loop are counter-productive for sure.

Hence, I made a new rasterize() function. Thereby, I considered that rasterize() has to fill horizontal screen lines. For this, I sort the 3 triangle vertices by their y components. (Note, that vertices have to be transformed to screen space beforehand.) In regular case, this allows to cut the triangle horizontally into two parts:

the upper with peak above of horizontal base
the lower with peak below of horizontal base.

The source code follows below but the sketch may help to make all the interpolations a little bit more understandable:

To convert the vtcs beforehand to screen space, I used two matrices:

const Mat4x4f matProj(InitScale, 1.0f / Width, 1.0f / Height, 1.0f);
const Mat4x4f matScreen
  = Mat4x4f(InitScale, 0.5f * Width, 0.5f * Height, 1.0f)
  * Mat4x4f(InitTrans, Vec3f(1.0f, 1.0f, 0.0f));
const Mat4x4f mat = matScreen * matProj /* * matView * matModel */;

where

matProj is the projection matrix to scale model coordinates to clip space
matScreen is responsible to convert clip space to screen space.

Clip space is a space where visible part is in range [-1, 1] for x, y, and z direction.

Screen space has ranges [0, width) and [0, height) where I used width = 1024 and height = 768 according to OPs requirement.

When I create OpenGL programs it's (some kind of) annoying tradition that I everytimes start with a blue screen (as blue is my favorite clear color). This is mostly caused by confusing some of the matrix operations and followed by the effort to find and fix these operations. Therefore, I tried to visualize the rasterization result to become sure that my benchmarking is not overly enthusiastic due to the fact that simply any part of triangle is outside of view and hence clipped away:

I would count this a visual proof. To achieve this, I wrapped the rasterizer sample into a simple Qt application where the FBO data is stored in a QImage which in turn is applied to a QPixmap displayed with a QLabel. Once I got this, I thought it would be nice to add some kind of simple animation by modifying the model-view matrix over time. So, I got the following sample:

#include <cstdint>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>

#include "linmath.h"

typedef unsigned uint;
typedef std::uint32_t uint32;

const int Width = 1024, Height = 768;

class FBO {
  public:
    const int width, height;
  private:
    std::vector<uint32> _rgba; // pixel buffer
  public:
    FBO(int width, int height):
      width(width), height(height),
      _rgba(width * height, 0)
    { }
    ~FBO() = default;
    FBO(const FBO&) = delete;
    FBO& operator=(const FBO&) = delete;

    void clear(uint32 rgba) { std::fill(_rgba.begin(), _rgba.end(), rgba); }
    void set(int x, int y, uint32 rgba)
    {
      const size_t i = y * width + x;
      _rgba[i] = rgba;
    }
    size_t getI(int y) const { return y * width; }
    size_t getI(int x, int y) const { return y * width + x; }
    void set(size_t i, uint32 rgba) { _rgba[i] = rgba; }

    const std::vector<uint32>& getRGBA() const { return _rgba; }
};

void rasterize(FBO &fbo, const Vec3f vtcs[3], const uint32 rgba)
{
  // sort vertices by y coordinates
  uint iVtcs[3] = { 0, 1, 2 };
  if (vtcs[iVtcs[0]].y > vtcs[iVtcs[1]].y) std::swap(iVtcs[0], iVtcs[1]);
  if (vtcs[iVtcs[1]].y > vtcs[iVtcs[2]].y) std::swap(iVtcs[1], iVtcs[2]);
  if (vtcs[iVtcs[0]].y > vtcs[iVtcs[1]].y) std::swap(iVtcs[0], iVtcs[1]);
  const Vec3f vtcsS[3] = { vtcs[iVtcs[0]], vtcs[iVtcs[1]], vtcs[iVtcs[2]] };
  const float yM = vtcs[1].y;
  // cut triangle in upper and lower part
  const float xT = vtcsS[0].x;
  float xML = vtcsS[1].x;
  const float f = (yM - vtcsS[0].y) / (vtcsS[2].y - vtcsS[0].y);
  float xMR = (1.0f - f) * xT + f * vtcsS[2].x;
  if (xML > xMR) std::swap(xML, xMR);
  // draw upper part of triangle
  if (vtcs[iVtcs[0]].y < yM) {
    const float dY = yM - vtcsS[0].y;
    for (int y = std::max((int)vtcsS[0].y, 0),
      yE = std::min((int)(yM + 0.5f), fbo.height);
      y < yE; ++y) {
      const float f1 = (yM - y) / dY, f0 = 1.0f - f1;
      const float xL = f0 * xT + f1 * xML, xR = f0 * xT + f1 * xMR;
      const size_t i = fbo.getI(y);
      for (int x = std::max((int)xL, 0),
        xE = std::min((int)(xR + 0.5f), fbo.width);
        x < xE; ++x) {
        fbo.set(i + x, rgba);
      }
    }
  } // else upper edge horizontal
  // draw lower part of triangle
  if (yM < vtcs[2].y) {
    const float xB = vtcsS[2].x;
    const float dY = vtcsS[2].y - yM;
    for (int y = std::max((int)yM, 0),
      yE = std::min((int)(vtcsS[2].y + 0.5f), fbo.height);
      y < yE; ++y) {
      const float f1 = (y - yM) / dY, f0 = 1.0f - f1;
      const float xL = f0 * xML + f1 * xB, xR = f0 * xMR + f1 * xB;
      const size_t i = fbo.getI(y);
      for (int x = std::max((int)xL, 0),
        xE = std::min((int)(xR + 0.5f), fbo.width);
        x < xE; ++x) {
        fbo.set(i + x, rgba);
      }
    }
  } // else lower edge horizontal
}

template <typename VALUE>
Vec3T<VALUE> transformPoint(const Mat4x4T<VALUE> &mat, const Vec3T<VALUE> &pt)
{
  Vec4T<VALUE> pt_ = mat * Vec4T<VALUE>(pt.x, pt.y, pt.z, (VALUE)1);
  return pt_.w != (VALUE)0
    ? Vec3T<VALUE>(pt_.x / pt_.w, pt_.y / pt_.w, pt_.z / pt_.w)
    : Vec3T<VALUE>(pt_.x, pt_.y, pt_.z);
}

typedef std::chrono::high_resolution_clock HiResClock;
typedef std::chrono::microseconds MicroSecs;

int mainBench()
{
  const Mat4x4f matProj(InitScale, 1.0f / Width, 1.0f / Height, 1.0f);
  const Mat4x4f matScreen
    = Mat4x4f(InitScale, 0.5f * Width, 0.5f * Height, 1.0f)
    * Mat4x4f(InitTrans, Vec3f(1.0f, 1.0f, 0.0f));
  const Mat4x4f mat = matScreen * matProj /* * matView * matModel */;
  FBO fbo(Width, Height);

  HiResClock::time_point start = HiResClock::now();
  size_t fps = 0;

  for (;;) {
    static Vec3f v0 = { -2000.0f, -2000.0, 10.0 };
    static Vec3f v1 = { 2000.0f, -2000.0, 10.0 };
    static Vec3f v2 = { 0.0f, 2000.0, 10.0 };
    const Vec3f vtcs[] = {
      transformPoint(mat, v0), transformPoint(mat, v1), transformPoint(mat, v2)
    };
    rasterize(fbo, vtcs, 0xff0000ff);
    ++fps;

    HiResClock::time_point now = HiResClock::now();
    auto timeDiff
      = std::chrono::duration_cast<MicroSecs>(now - start).count();

    static const long oneSecond = 1000000;
    if (timeDiff >= oneSecond) {
      std::cout << fps << " " << std::flush;
      fps = 0;
      start = now;
    }
  }
}

#include <stack>

#include <QtWidgets>

struct RenderContext {
  FBO fbo; // frame buffer object
  Mat4x4f matModelView;
  Mat4x4f matProj;

  RenderContext(int width, int height):
    fbo(width, height),
    matModelView(InitIdent),
    matProj(InitIdent)
  { }
  ~RenderContext() = default;
};

void render(RenderContext &context)
{
  context.fbo.clear(0xffffcc88);
  static Vec3f v0 = { -2000.0f, -2000.0, 10.0 };
  static Vec3f v1 = { 2000.0f, -2000.0, 10.0 };
  static Vec3f v2 = { 0.0f, 2000.0, 10.0 };
#if 0 // test transformations of mainBench()
  const Mat4x4f matProj(InitScale, 1.0f / Width, 1.0f / Height, 1.0f);
  const Mat4x4f matScreen
    = Mat4x4f(InitScale, 0.5f * Width, 0.5f * Height, 1.0f)
    * Mat4x4f(InitTrans, Vec3f(1.0f, 1.0f, 0.0f));
  const Mat4x4f mat = matScreen * matProj /* * matView * matModel */;
#else // make a simple animation
  const Mat4x4f matScreen
    = Mat4x4f(InitScale, 0.5f * context.fbo.width, 0.5f * context.fbo.height, 1.0f)
    * Mat4x4f(InitTrans, Vec3f(1.0f, 1.0f, 0.0f));
  const Mat4x4f mat = matScreen * context.matProj * context.matModelView;
#endif // 0
  const Vec3f vtcs[] = {
    transformPoint(mat, v0), transformPoint(mat, v1), transformPoint(mat, v2)
  };
  rasterize(context.fbo, vtcs, 0xff0000ff);
}

int main(int argc, char **argv)
{
  // evaluate arguments
  const bool gui = argc > 1
    && (strcmp(argv[1], "gui") == 0 || strcmp(argv[1], "-gui") == 0);
  // without arguments: start benchmark
  if (!gui) return mainBench();
  // remove argument "gui"
  for (char **arg = argv + 1; *arg; ++arg) arg [0] = arg[1];
  // Qt Demo
  qDebug() << "Qt Version:" << QT_VERSION_STR;
  QApplication app(argc, argv);
  RenderContext context(Width, Height);
  // setup GUI
  QPixmap qPixmapImg(Width, Height);
  QLabel qLblImg;
  qLblImg.setWindowTitle("Software Rasterizer Demo");
  qLblImg.setPixmap(qPixmapImg);
  qLblImg.show();
  // Qt timer for periodic rendering/animation
  QTime qTime(0, 0);
  QTimer qTimer;
  qTimer.setInterval(50);
  // install signal handlers
  QObject::connect(&qTimer, &QTimer::timeout,
    [&]() {
      // simple animation
      const float r = 1.0f, t = 0.001f * qTime.elapsed();
      context.matModelView
        = Mat4x4f(InitTrans, Vec3f(r * sin(t), r * cos(t), 0.0f))
        * Mat4x4f(InitScale, 0.0001f, 0.0001f, 0.0001f);
      // rendering
      render(context);
      // transfer FBO to qLblImg
      const QImage qImg((uchar*)context.fbo.getRGBA().data(),
        Width, Height, QImage::Format_RGBA8888);
      qPixmapImg.convertFromImage(qImg);
      qLblImg.setPixmap(qPixmapImg);
    });
  // runtime loop
  qTime.start(); qTimer.start();
  return app.exec();
}

Instead of gmtl (which I don't know), I used linmath.h (left from earlier MCVEs I wrote in the past):

#ifndef LIN_MATH_H
#define LIN_MATH_H

#include <iostream>
#include <cassert>
#include <cmath>

double Pi = 3.1415926535897932384626433832795;

template <typename VALUE>
inline VALUE degToRad(VALUE angle)
{
  return (VALUE)Pi * angle / (VALUE)180;
}

template <typename VALUE>
inline VALUE radToDeg(VALUE angle)
{
  return (VALUE)180 * angle / (VALUE)Pi;
}

template <typename VALUE>
struct Vec2T {
  VALUE x, y;
  Vec2T() { }
  Vec2T(VALUE x, VALUE y): x(x), y(y) { }
};
template <typename VALUE>
VALUE length(const Vec2T<VALUE> &vec)
{
  return sqrt(vec.x * vec.x + vec.y * vec.y);
}
template <typename VALUE>
std::ostream& operator<<(std::ostream &out, const Vec2T<VALUE> &v)
{
  return out << "( " << v.x << ", " << v.y << " )";
}

typedef Vec2T<float> Vec2f;
typedef Vec2T<double> Vec2;

template <typename VALUE>
struct Vec3T {
  VALUE x, y, z;
  Vec3T() { }
  Vec3T(VALUE x, VALUE y, VALUE z): x(x), y(y), z(z) { }
  Vec3T(const Vec2T<VALUE> &xy, VALUE z): x(xy.x), y(xy.y), z(z) { }
  explicit operator Vec2T<VALUE>() const { return Vec2T<VALUE>(x, y); }
};
template <typename VALUE>
std::ostream& operator<<(std::ostream &out, const Vec3T<VALUE> &v)
{
  return out << "( " << v.x << ", " << v.y << ", " << v.z << " )";
}

typedef Vec3T<float> Vec3f;
typedef Vec3T<double> Vec3;

template <typename VALUE>
struct Vec4T {
  VALUE x, y, z, w;
  Vec4T() { }
  Vec4T(VALUE x, VALUE y, VALUE z, VALUE w): x(x), y(y), z(z), w(w) { }
  Vec4T(const Vec2T<VALUE> &xy, VALUE z, VALUE w):
    x(xy.x), y(xy.y), z(z), w(w)
  { }
  Vec4T(const Vec3T<VALUE> &xyz, VALUE w):
    x(xyz.x), y(xyz.y), z(xyz.z), w(w)
  { }
  explicit operator Vec2T<VALUE>() const { return Vec2T<VALUE>(x, y); }
  explicit operator Vec3T<VALUE>() const { return Vec3T<VALUE>(x, y, z); }
};
template <typename VALUE>
std::ostream& operator<<(std::ostream &out, const Vec4T<VALUE> &v)
{
  return out << "( " << v.x << ", " << v.y << ", " << v.z << ", " << v.w << " )";
}

typedef Vec4T<float> Vec4f;
typedef Vec4T<double> Vec4;

enum ArgInitIdent { InitIdent };
enum ArgInitTrans { InitTrans };
enum ArgInitRot { InitRot };
enum ArgInitRotX { InitRotX };
enum ArgInitRotY { InitRotY };
enum ArgInitRotZ { InitRotZ };
enum ArgInitScale { InitScale };

template <typename VALUE>
struct Mat4x4T {
  union {
    VALUE comp[4 * 4];
    struct {
      VALUE _00, _01, _02, _03;
      VALUE _10, _11, _12, _13;
      VALUE _20, _21, _22, _23;
      VALUE _30, _31, _32, _33;
    };
  };

  // constructor to build a matrix by elements
  Mat4x4T(
    VALUE _00, VALUE _01, VALUE _02, VALUE _03,
    VALUE _10, VALUE _11, VALUE _12, VALUE _13,
    VALUE _20, VALUE _21, VALUE _22, VALUE _23,
    VALUE _30, VALUE _31, VALUE _32, VALUE _33):
    _00(_00), _01(_01), _02(_02), _03(_03),
    _10(_10), _11(_11), _12(_12), _13(_13),
    _20(_20), _21(_21), _22(_22), _23(_23),
    _30(_30), _31(_31), _32(_32), _33(_33)
  { }
  // constructor to build an identity matrix
  Mat4x4T(ArgInitIdent):
    _00((VALUE)1), _01((VALUE)0), _02((VALUE)0), _03((VALUE)0),
    _10((VALUE)0), _11((VALUE)1), _12((VALUE)0), _13((VALUE)0),
    _20((VALUE)0), _21((VALUE)0), _22((VALUE)1), _23((VALUE)0),
    _30((VALUE)0), _31((VALUE)0), _32((VALUE)0), _33((VALUE)1)
  { }
  // constructor to build a matrix for translation
  Mat4x4T(ArgInitTrans, const Vec3T<VALUE> &t):
    _00((VALUE)1), _01((VALUE)0), _02((VALUE)0), _03((VALUE)t.x),
    _10((VALUE)0), _11((VALUE)1), _12((VALUE)0), _13((VALUE)t.y),
    _20((VALUE)0), _21((VALUE)0), _22((VALUE)1), _23((VALUE)t.z),
    _30((VALUE)0), _31((VALUE)0), _32((VALUE)0), _33((VALUE)1)
  { }
  // constructor to build a matrix for rotation about axis
  Mat4x4T(ArgInitRot, const Vec3T<VALUE> &axis, VALUE angle):
    _03((VALUE)0), _13((VALUE)0), _23((VALUE)0),
    _30((VALUE)0), _31((VALUE)0), _32((VALUE)0), _33((VALUE)1)
  {
    //axis.normalize();
    const VALUE sinAngle = sin(angle), cosAngle = cos(angle);
    const VALUE xx = axis.x * axis.x, xy = axis.x * axis.y;
    const VALUE xz = axis.x * axis.z, yy = axis.y * axis.y;
    const VALUE yz = axis.y * axis.z, zz = axis.z * axis.z;
    _00 = xx + cosAngle * ((VALUE)1 - xx) /* + sinAngle * 0 */;
    _01 = xy - cosAngle * xy - sinAngle * axis.z;
    _02 = xz - cosAngle * xz + sinAngle * axis.y;
    _10 = xy - cosAngle * xy + sinAngle * axis.z;
    _11 = yy + cosAngle * ((VALUE)1 - yy) /* + sinAngle * 0 */;
    _12 = yz - cosAngle * yz - sinAngle * axis.x;
    _20 = xz - cosAngle * xz - sinAngle * axis.y;
    _21 = yz - cosAngle * yz + sinAngle * axis.x;
    _22 = zz + cosAngle * ((VALUE)1 - zz) /* + sinAngle * 0 */;
  }
  // constructor to build a matrix for rotation about x axis
  Mat4x4T(ArgInitRotX, VALUE angle):
    _00((VALUE)1), _01((VALUE)0),    _02((VALUE)0),   _03((VALUE)0),
    _10((VALUE)0), _11(cos(angle)),  _12(-sin(angle)), _13((VALUE)0),
    _20((VALUE)0), _21(sin(angle)), _22(cos(angle)), _23((VALUE)0),
    _30((VALUE)0), _31((VALUE)0),    _32((VALUE)0),   _33((VALUE)1)
  { }
  // constructor to build a matrix for rotation about y axis
  Mat4x4T(ArgInitRotY, VALUE angle):
    _00(cos(angle)), _01((VALUE)0), _02(sin(angle)), _03((VALUE)0),
    _10((VALUE)0),   _11((VALUE)1), _12((VALUE)0),    _13((VALUE)0),
    _20(-sin(angle)), _21((VALUE)0), _22(cos(angle)),  _23((VALUE)0),
    _30((VALUE)0), _31((VALUE)0),    _32((VALUE)0),   _33((VALUE)1)
  { }
  // constructor to build a matrix for rotation about z axis
  Mat4x4T(ArgInitRotZ, VALUE angle):
    _00(cos(angle)),  _01(-sin(angle)), _02((VALUE)0), _03((VALUE)0),
    _10(sin(angle)), _11(cos(angle)), _12((VALUE)0), _13((VALUE)0),
    _20((VALUE)0),    _21((VALUE)0),   _22((VALUE)1), _23((VALUE)0),
    _30((VALUE)0),    _31((VALUE)0),   _32((VALUE)0), _33((VALUE)1)
  { }
  // constructor to build a matrix for scaling
  Mat4x4T(ArgInitScale, VALUE sx, VALUE sy, VALUE sz):
    _00((VALUE)sx), _01((VALUE)0),  _02((VALUE)0),  _03((VALUE)0),
    _10((VALUE)0),  _11((VALUE)sy), _12((VALUE)0),  _13((VALUE)0),
    _20((VALUE)0),  _21((VALUE)0),  _22((VALUE)sz), _23((VALUE)0),
    _30((VALUE)0),  _31((VALUE)0),  _32((VALUE)0),  _33((VALUE)1)
  { }
  // operator to allow access with [][]
  double* operator [] (int i)
  {
    assert(i >= 0 && i < 4);
    return comp + 4 * i;
  }
  // operator to allow access with [][]
  const double* operator [] (int i) const
  {
    assert(i >= 0 && i < 4);
    return comp + 4 * i;
  }
  // multiply matrix with matrix -> matrix
  Mat4x4T operator * (const Mat4x4T &mat) const
  {
    return Mat4x4T(
      _00 * mat._00 + _01 * mat._10 + _02 * mat._20 + _03 * mat._30,
      _00 * mat._01 + _01 * mat._11 + _02 * mat._21 + _03 * mat._31,
      _00 * mat._02 + _01 * mat._12 + _02 * mat._22 + _03 * mat._32,
      _00 * mat._03 + _01 * mat._13 + _02 * mat._23 + _03 * mat._33,
      _10 * mat._00 + _11 * mat._10 + _12 * mat._20 + _13 * mat._30,
      _10 * mat._01 + _11 * mat._11 + _12 * mat._21 + _13 * mat._31,
      _10 * mat._02 + _11 * mat._12 + _12 * mat._22 + _13 * mat._32,
      _10 * mat._03 + _11 * mat._13 + _12 * mat._23 + _13 * mat._33,
      _20 * mat._00 + _21 * mat._10 + _22 * mat._20 + _23 * mat._30,
      _20 * mat._01 + _21 * mat._11 + _22 * mat._21 + _23 * mat._31,
      _20 * mat._02 + _21 * mat._12 + _22 * mat._22 + _23 * mat._32,
      _20 * mat._03 + _21 * mat._13 + _22 * mat._23 + _23 * mat._33,
      _30 * mat._00 + _31 * mat._10 + _32 * mat._20 + _33 * mat._30,
      _30 * mat._01 + _31 * mat._11 + _32 * mat._21 + _33 * mat._31,
      _30 * mat._02 + _31 * mat._12 + _32 * mat._22 + _33 * mat._32,
      _30 * mat._03 + _31 * mat._13 + _32 * mat._23 + _33 * mat._33);
  }
  // multiply matrix with vector -> vector
  Vec4T<VALUE> operator * (const Vec4T<VALUE> &vec) const
  {
    return Vec4T<VALUE>(
      _00 * vec.x + _01 * vec.y + _02 * vec.z + _03 * vec.w,
      _10 * vec.x + _11 * vec.y + _12 * vec.z + _13 * vec.w,
      _20 * vec.x + _21 * vec.y + _22 * vec.z + _23 * vec.w,
      _30 * vec.x + _31 * vec.y + _32 * vec.z + _33 * vec.w);
  }
};

template <typename VALUE>
std::ostream& operator<<(std::ostream &out, const Mat4x4T<VALUE> &m)
{
  return out
    << m._00 << '\t' << m._01 << '\t' << m._02 << '\t' << m._03 << '\n'
    << m._10 << '\t' << m._11 << '\t' << m._12 << '\t' << m._13 << '\n'
    << m._20 << '\t' << m._21 << '\t' << m._22 << '\t' << m._23 << '\n'
    << m._30 << '\t' << m._31 << '\t' << m._32 << '\t' << m._33 << '\n';
}
typedef Mat4x4T<float> Mat4x4f;
typedef Mat4x4T<double> Mat4x4;

// enumeration of rotation axes
enum RotAxis {
  RotX, // rotation about x axis
  RotY, // rotation about y axis
  RotZ // rotation about z axis
};

// enumeration of possible Euler angles
enum EulerAngle {
  RotXYX = RotX + 3 * RotY + 9 * RotX, // 0 + 3 + 0 = 3
  RotXYZ = RotX + 3 * RotY + 9 * RotZ, // 0 + 3 + 18 = 21
  RotXZX = RotX + 3 * RotZ + 9 * RotX, // 0 + 6 + 0 = 6
  RotXZY = RotX + 3 * RotZ + 9 * RotY, // 0 + 6 + 9 = 15
  RotYXY = RotY + 3 * RotX + 9 * RotY, // 1 + 0 + 9 = 10
  RotYXZ = RotY + 3 * RotX + 9 * RotZ, // 1 + 0 + 18 = 19
  RotYZX = RotY + 3 * RotZ + 9 * RotX, // 1 + 6 + 0 = 7
  RotYZY = RotY + 3 * RotZ + 9 * RotY, // 1 + 6 + 9 = 16
  RotZXY = RotZ + 3 * RotX + 9 * RotY, // 2 + 0 + 9 = 11
  RotZXZ = RotZ + 3 * RotX + 9 * RotZ, // 2 + 0 + 18 = 20
  RotZYX = RotZ + 3 * RotY + 9 * RotX, // 2 + 3 + 0 = 5
  RotZYZ = RotZ + 3 * RotY + 9 * RotZ, // 2 + 3 + 18 = 23
  RotHPR = RotZXY, // used in OpenGL Performer
  RotABC = RotZYX // used in German engineering
};

/* decomposes the combined EULER angle type into the corresponding
 * individual EULER angle axis types.
 */
inline void decompose(
  EulerAngle type, RotAxis &axis1, RotAxis &axis2, RotAxis &axis3)
{
  unsigned type_ = (unsigned)type;
  axis1 = (RotAxis)(type_ % 3); type_ /= 3;
  axis2 = (RotAxis)(type_ % 3); type_ /= 3;
  axis3 = (RotAxis)type_;
}

template <typename VALUE>
Mat4x4T<VALUE> makeEuler(
  EulerAngle mode, VALUE rot1, VALUE rot2, VALUE rot3)
{
  RotAxis axis1, axis2, axis3;
  decompose(mode, axis1, axis2, axis3);
  const static VALUE axes[3][3] = {
    { (VALUE)1, (VALUE)0, (VALUE)0 },
    { (VALUE)0, (VALUE)1, (VALUE)0 },
    { (VALUE)0, (VALUE)0, (VALUE)1 }
  };
  return
      Mat4x4T<VALUE>(InitRot,
        Vec3T<VALUE>(axes[axis1][0], axes[axis1][1], axes[axis1][2]),
        rot1)
    * Mat4x4T<VALUE>(InitRot,
        Vec3T<VALUE>(axes[axis2][0], axes[axis2][1], axes[axis2][2]),
        rot2)
    * Mat4x4T<VALUE>(InitRot,
        Vec3T<VALUE>(axes[axis3][0], axes[axis3][1], axes[axis3][2]),
        rot3);
}

/* decomposes a rotation matrix into EULER angles.
 *
 * It is necessary that the upper left 3x3 matrix is composed of rotations
 * only. Translational parts are not considered.
 * Other transformations (e.g. scaling, shearing, projection) may cause
 * wrong results.
 */
template <typename VALUE>
void decompose(
  const Mat4x4T<VALUE> &mat,
  RotAxis axis1, RotAxis axis2, RotAxis axis3,
  VALUE &angle1, VALUE &angle2, VALUE &angle3)
{
  assert(axis1 != axis2 && axis2 != axis3);
  /* This is ported from EulerAngles.h of the Eigen library. */
  const int odd = (axis1 + 1) % 3 == axis2 ? 0 : 1;
  const int i = axis1;
  const int j = (axis1 + 1 + odd) % 3;
  const int k = (axis1 + 2 - odd) % 3;
  if (axis1 == axis3) {
    angle1 = atan2(mat[j][i], mat[k][i]);
    if ((odd && angle1 < (VALUE)0) || (!odd && angle1 > (VALUE)0)) {
      angle1 = angle1 > (VALUE)0 ? angle1 - (VALUE)Pi : angle1 + (VALUE)Pi;
      const VALUE s2 = length(Vec2T<VALUE>(mat[j][i], mat[k][i]));
      angle2 = -atan2(s2, mat[i][i]);
    } else {
      const VALUE s2 = length(Vec2T<VALUE>(mat[j][i], mat[k][i]));
      angle2 = atan2(s2, mat[i][i]);
    }
    const VALUE s1 = sin(angle1);
    const VALUE c1 = cos(angle1);
    angle3 = atan2(c1 * mat[j][k] - s1 * mat[k][k],
      c1 * mat[j][j] - s1 * mat[k][j]);
  } else {
    angle1 = atan2(mat[j][k], mat[k][k]);
    const VALUE c2 = length(Vec2T<VALUE>(mat[i][i], mat[i][j]));
    if ((odd && angle1<(VALUE)0) || (!odd && angle1 > (VALUE)0)) {
      angle1 = (angle1 > (VALUE)0)
        ? angle1 - (VALUE)Pi : angle1 + (VALUE)Pi;
      angle2 = atan2(-mat[i][k], -c2);
    } else angle2 = atan2(-mat[i][k], c2);
    const VALUE s1 = sin(angle1);
    const VALUE c1 = cos(angle1);
    angle3 = atan2(s1 * mat[k][i] - c1 * mat[j][i],
      c1 * mat[j][j] - s1 * mat[k][j]);
  }
  if (!odd) {
    angle1 = -angle1; angle2 = -angle2; angle3 = -angle3;
  }
}

#endif // LIN_MATH_H

Before I added the Qt code I compiled simply with:

$ g++ -std=c++11 -o rasterizerSimple rasterizerSimple.cc

To compile with Qt, I wrote a Qt project rasterizerSimple.pro:

SOURCES = rasterizerSimple.cc

QT += widgets

Compiled and tested on cygwin64 on Windows 10:

$ qmake-qt5 rasterizerSimple.pro

$ make

$ ./rasterizerSimple
2724 2921 3015 2781 2800 2805 2859 2783 2931 2871 2902 2983 2995 2882 2940 2878 3007 2993 3066 3067 3118 3144 2849 3084 3020 3005 3030 2991 2999 3065 2941 3123 3119 2905 3135 2938

CtrlC

My laptop has an Intel i7 (and produced a bit cooler noise when I let run the program a bit longer).

That looks not that bad – an average of 3000 triangles per second. (It beats the the 130 FPS of OP if his i7 is not much slower than mine.)

May be, the FPS could be pushed a bit more by further improvements. For example, interpolation could be done using Bresenham's line algorithm though it does not appear in most inner loop. To optimize most inner loop, the filling of horizontal line should be done in a faster fashion (although I'm not quite sure how to achieve this).

To start the Qt visualization, the application has to be started with gui:

$ ./rasterizerSimple gui

This sample inspired me to a toy project No-GL 3D Renderer which can be found on github.

Wow, thats the performance I was hopping to see, good job! I tried this approach before, but I had problem with perspective correct interpolation so I used the approach from here https://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/perspective-correct-interpolation-vertex-attributes Still not sure why your approach is that much faster, but I am looking forward to analyze it over the weekend, thanks for help! — Martin, Sep 27 '18 at 13:57
@Martin Optimization of most inner loop is always the best idea... ;-) (and it was actually not mine). — Scheff's Cat, Sep 27 '18 at 14:12
@Martin I had a look into your link. Consider, that I used most simple rendering, no depth buffer values (I had interpolated z values but removed them later to keep it simple), no color interpolation, no lighting. If all these things were availble, it would become definitely slower. I've still some ideas about this in mind and want to make it a toy project (but I don't know if I find enough time these days). If I've any notable outcome I drop a note. — Scheff's Cat, Sep 27 '18 at 14:23
@Martin The only thing, which could be added easily (and probably without significant performance impact) is perspective distortion - there isn't a resp. constructor in `linmath.h` but I could add this. (I've an implementation locally at hand to "borrow" from.) — Scheff's Cat, Sep 27 '18 at 14:27
Actually the if inside barycentric func is optimalization, its cheaper to return than always do additional calculations. — Martin, Sep 27 '18 at 14:45
@Martin I turned it around: It's much cheaper to consider only vertices which are inside of triangle. Hence, I interpolate along triangle sides to render each raster line only from left edge to right edge. As I missed the part with color interpolation in your code, it reduces to a simple assignment (where no further `if` is necessary). (Guess how much FPS I achieved when I tried first - with wrong transformations which resulted in only three or four pixels in visible range...) ;-) — Scheff's Cat, Sep 27 '18 at 14:50
When I remove everything extra and calculate just the barycentric weights (i need them for future interpolations such as colors, tex coords, etc...) I have about 350fps when compiling for x32. When I switch to x64 I get around 425fps. But still its just 1 fullscreen polygon and not even including any matrix transformations. — Martin, Sep 27 '18 at 14:51
@Martin And btw. [_Quake II came with OpenGL support out of the box._](https://en.wikipedia.org/wiki/Quake_II) There is a reason why per-vertex transformations and per-pixel calculations have been moved to the GPU which has meanwhile 100s of cores. (My K5100M has 1536.) If you calculate all pixels concurrently, the effort for per-pixel calculations doesn't hit you so hard anymore. So, we shouldn't be too sad with 100 FPS in a pure CPU rasterization (which actually uses precisely one core)... — Scheff's Cat, Sep 27 '18 at 15:09
@Martin I did that toy project the last days and just did an upload on github [No-GL 3D Renderer](https://github.com/scheff173/NoGL3dDemo). — Scheff's Cat, Oct 03 '18 at 16:00

software rasterization implementation speed up ideas

1 Answers1