I'd suggest you to not use optimize=True
flag because it's in-efficient, for some strange reason. Also, I'd recommend you to explicitly promote the 2D-array to 3D, perform batch matrix multiplication and then squeeze the singleton dimension of the resultant array, if you at the end need a 2D array as the final result. Please find the code below:
# sample arrays
In [25]: v1 = np.random.random_sample((3000, 3))
In [26]: v2 = np.random.random_sample((3, 2, 3000))
# Divakar's approach
In [27]: %timeit np.einsum('ij,jki->ik',v1,v2, optimize=True)
80.7 µs ± 792 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# needed for future use
In [28]: res_optimized = np.einsum('ij,jki->ik',v1,v2, optimize=True)
# promoting to 3D array and swapping axes
In [29]: v1 = v1[:, np.newaxis, :]
In [30]: v2 = np.moveaxis(v2, 2, 0)
# perform batch matrix multiplication
In [31]: %timeit np.einsum("bij, bjk -> bik", v1, v2)
47.9 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# for sanity checking
In [32]: res = np.einsum("bij, bjk -> bik", v1, v2)
In [33]: res.shape, res_optimized.shape
Out[33]: ((3000, 1, 2), (3000, 2))
# squeeze the singleton dimension and perform sanity check with Divakar's approach
In [34]: np.allclose(res.squeeze(), res_optimized)
Out[34]: True
So, as we can see from the above timings, we gain approx. 2x speedup by not using optimize=True
flag. Also, explicitly shaping the arrays to 3D gives bit more understanding about what's going on when we use numpy.einsum()
.
Note: timings were performed using the latest NumPy version '1.16.1'
P.S. Read more about Understanding NumPy einsum()