3

Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.

I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().

Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.

We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.

Python3 code

import numpy as np
import time
import sys

if (len(sys.argv) != 2):
    print("Pass `size` as an argument.", file = sys.stderr);
    sys.exit(1);
size = int(sys.argv[1]);

A = np.ndarray((size, size));
B = np.ndarray((size, size));

for i in range(size):
    for j in range(size):
        A[i][j] = i * 3.14 + j
        B[i][j] = i * 3.14 - j

start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);

C++ code

using namespace std;
#include <iostream>
#include <vector>
#include <chrono>

int main(int argc, char **argv) {

    if (argc != 2) {
        cerr << "Pass `size` as an argument.\n";
        return 1;
    }
    const unsigned size = atoi(argv[1]);

    vector<vector<double>> A(size, vector<double>(size));
    vector<vector<double>> B(size, vector<double>(size));

    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            A[i][j] = i * 3.14 + j;
            B[i][j] = i * 3.14 - j;
        }
    }

    auto start = chrono::system_clock::now();

    vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }

    cerr << scientific;
    cerr.precision(3);
    cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";

}

C++ code (multithreaded)

We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.

You can get all the codes from GitHub.

Result

C++ version is 10+ times slower than Python 3 (with numpy) version.

matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
         C++ (1 thread): 8.66e-03
        C++ (2 threads): 4.68e-03
        C++ (3 threads): 3.14e-03
        C++ (4 threads): 2.43e-03
               Python 3: 4.07e-04
-----------------------------------------------

matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
         C++ (1 thread): 6.985e-02
        C++ (2 threads): 3.647e-02
        C++ (3 threads): 2.462e-02
        C++ (4 threads): 1.915e-02
               Python 3: 1.466e-03
-----------------------------------------------

Question

Is there any way to make the C++ implementation faster?


Optimizations I Tried

  1. swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)

  2. optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.

  3. optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.

  4. optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)

  5. Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.

  6. g++ -funroll-loops flag to unroll for loops -> no change

  7. g++ #pragma GCC unroll n -> no change

  8. g++ -flto flag to turn on link time optimizations -> no change

  9. Block Algorithm -> no change

  10. transpose B to avoid cache miss -> no change

  11. long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster

  12. Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster

  13. Optimization 3, plus PGO -> same as Optimization 3

  14. Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3


Current Status

(originally) 13.06 times slower -> (currently) 2.10 times slower

Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.

Original Code (GitHub)

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
    const unsigned j_max = B[0].size();
    const unsigned k_max = B.size();
    for (int i = row_start; i < row_end; ++i) {
        for (int j = 0; j < j_max; ++j) {
            for (int k = 0; k < k_max; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

Current Best Code (GitHub)

This is the implementation of the Optimization 5 above.

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {

    static const unsigned num_unroll = 5;

    const unsigned j_max = B[0].size();
    const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
    const unsigned k_max = B.size();

    for (int i = row_start; i < row_end; ++i) {
        for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
            for (int j = 0; j < j_max; ++j) {
                const double *p = A[i].data() + k;
                double sum;
                sum = *p++ * B[k][j];
                sum += *p++ * B[k+1][j];
                sum += *p++ * B[k+2][j];
                sum += *p++ * B[k+3][j];
                sum += *p++ * B[k+4][j];
                C[i][j] += sum;
            }
        }
        for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
            const double a = A[i][k];
            for (int j = 0; j < j_max; ++j) {
                C[i][j] += a * B[k][j];
            }
        }
    }

}

We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.

We (want to) believe we can make it better, though.

ynn
  • 1,867
  • 10
  • 22
  • 1
    Partial unrolling does not require the size to be known in advance. There may be a small leftover piece to handle with non-unrolled loop, but that only affects a small part of the time as well: the unrolled speed is still *mostly* reached. – harold Apr 01 '20 at 17:29
  • @harold You are totally right. Let me link your comment from OP. – ynn Apr 01 '20 at 17:36
  • 1
    I have [some](https://codereview.stackexchange.com/a/213668/36018) [other](https://stackoverflow.com/a/33813176/555045) [answers](https://stackoverflow.com/a/58169377/555045) you might be interested in but none of them are a direct duplicate. You should really get rid of the nested vectors too though, a single vector is fine but the double indirection of the extra layer of vectors prevents some vectorizations (funny how that goes) – harold Apr 01 '20 at 17:47
  • Did you use PGO (profile-guided optimization?) `-fprofile-generate` / run it / `-fprofile-use` to use the profile data so it knows which loops are hot and worth unrolling, and which branches are predictable / unpredictable. – Peter Cordes Apr 02 '20 at 04:20
  • @PeterCordes I did but forgot to append the results to OP (since no improvement). Please see my edit. – ynn Apr 02 '20 at 04:45
  • Well, your matrix multiplication in C++ is just quite inefficient. It's the naive implementation. Numpy will employ a better algorithm. Take a look at https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm for some ideas about how matrix multiplication can be implemented efficiently. – cmaster - reinstate monica Apr 02 '20 at 14:39

1 Answers1

1

Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:

  1. Use SIMDInstructions
  2. Use Register Blocking and fetch multiple data at once
  3. Optimize for your chache lines (mainly L2 and L3)
  4. Parallelize your code to use multiple threads

Under this linke is a very good ressource, that explains all the nasty details: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

If you want more indepth advise leave a comment.

OutOfBound
  • 1,688
  • 8
  • 25
  • Thank you for your answer. I've seen many answers on StackOverflow which suggest SIMD instructions, but are there available for nested `vector` matrices? I'm not confident but I think they are useful when we have full control over how elements are arranged in the memory space. In my case, however, it is extremely difficult to optimize memory structure of `vector` itself (w/o playing with internal allocator). If such instructions are not compatible with `vector`, and if I have to use non-nested `vector` or builtin arrays, I rather use BLAS or LAPACK in the first place. – ynn Apr 02 '20 at 14:39
  • If you care about performance, you should get rid of vector anyway because the additional layer of indirection is bad for you cache friendlyness. – OutOfBound Apr 02 '20 at 14:42
  • Yes, you are right. But as written in the first line of OP, the purpose of this question is to make the most of `vector`. Using `vector` is the first principle here. If not, my question is totally the duplicate of many other questions already posted and answered. – ynn Apr 02 '20 at 14:45
  • You can use simd operations on the inner vector, but optimising based on a bad memory layout will not yield good performance. Can you build a matrix class and copy the interface of vector to play nice with your algorithms? – OutOfBound Apr 02 '20 at 14:50
  • Yes. It should be easy. I'll try that. Thank you for your suggestion :) – ynn Apr 02 '20 at 14:57
  • 1
    You can still use SIMD with the nested vectors, a wide SIMD-load from the B matrix is still possible (elements in a vector are contiguous and only rows matter for B), and from A there will be a broadcasted load anyway so the layout doesn't matter much (except that the extra indirection costs overhead, but I mean it doesn't block SIMD) – harold Apr 02 '20 at 16:07