7

I have been curious about this for some time. I can live with that, but it always bites me when enough care is not taken, so I decide to post it here. Suppose the following example (Numpy version = 1.8.2):

a = array([[0, 1], [2, 3]])
print shape(a[0:0, :]) # (0, 2)
print shape(a[0:1, :]) # (1, 2)
print shape(a[0:2, :]) # (2, 2)
print shape(a[0:100, :]) # (2, 2)

print shape(a[0]) # (2, )
print shape(a[0, :]) # (2, )
print shape(a[:, 0]) # (2, )

I don't know how other people feel, but the result feels inconsistent to me. The last line is a column vector while the second to last line is a row vector, they should have different dimension -- in linear algebra they do! (Line 5 is another surprise, but I will neglect it for now). Consider a second example:

solution = scipy.sparse.linalg.dsolve.linsolve.spsolve(A, b) # solution of dimension (n, )
analytic = reshape(f(x, y), (n, 1)) # analytic of dimension (n, 1)
error = solution - analytic

Now error is of dimension (n, n). Yes, in the second line I should use (n, ) instead of (n, 1), but why? I used to use MATLAB a lot, where one-d vector has dimension (n, 1), linspace/arange returns array of dimension (n, 1), and there never exists (n, ). But in Numpy (n, 1) and (n, ) coexist, and there are many functions for dimension handling alone: atleast, newaxis and different uses of reshape, but to me those functions are more of confusion than help. If an array print like [1,2,3], then intuitively the dimension should be [1,3] instead of [3,], right? If Numpy does not have (n, ), I can only see a gain in clarity, not a loss in functionality.

So there must be some design reason behind this. I have been searching from time to time, without finding a clear answer or report. Could someone help clarifying this confusion or provide me some useful references? Your help is much appreciated.

Taozi
  • 343
  • 5
  • 12
  • Not really a duplicate. http://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r launches into a lot of low-level detail rather than addressing the high-level conceptual gotchas arising from experience with Matlab, which is the OP's problem – jez Jan 09 '15 at 18:23
  • The design reason is that `numpy` was not created to be compatible with old `MATLAB` code. `MATLAB` started with 2d matrices, and has never shaken that legacy. – hpaulj Jan 09 '15 at 22:05
  • MATLAB does not always preserve dimensions, `x(:,:,1)` returns a 2d matrix, with an automatic `squeeze` on the final singleton dimension. `x(:,1)` is handled as a special case. – hpaulj Jan 09 '15 at 22:17
  • @hpaulj Agree, I have MATLAB implanted in my mind before, that's why I had a hard time understanding Numpy's different concept. Now the clear distinction between one dimensional and two dimensional arrays in Numpy is more appealing to me thanks to jez's answer. – Taozi Jan 09 '15 at 22:19
  • @hpaulj The converse is also true in matlab, ie that you can take `x(:,:,1)` or `x(:,:,:,:,:,:,1)` even if the array is only "two-dimensional". In other words, even a 2D array is implicitly `M x N x 1 x 1 x 1 x .....` which can be a lot more graceful than the numpy way. But I agree, in the end numpy's way is more consistent as a whole. – jez Jan 10 '15 at 04:00
  • `numpy` does the same, but on the left. A `np.newaxis` is automatically added on the left if needed for broadcasting. Does MATLAB have broadcasting? I know Octave added it as an option. – hpaulj Jan 11 '15 at 07:58
  • @hpaulj Matlab eventually did add broadcasting in the form of [bsxfun][1]. It's ugly, but it works. [1]: http://www.mathworks.com/help/matlab/ref/bsxfun.html – jez Jan 12 '15 at 19:44

1 Answers1

6

numpy's philosphy is not that a[:, 0] is a "column vector" and a[0, :] a "row vector" in the general case. Rather they are both, quite simply, vectors—i.e. arrays with one and only one dimension. This is actually highly logical and consistent (but yes, can get annoying for those of us accustomed to Matlab).

I say "in the general case" because that is true for numpy's most general data structure, the array, which is intended for all kinds of multi-dimensional dense data storage and manipulation applications—not just matrix math. Having "rows" and "columns" is a highly specialized context for array operations—but yes, a very common one: that's why numpy also supplies the matrix class. Convert your array to a numpy.matrix (or use the matrix constructor instead of array to begin with) and you will see behaviour closer to what you expect. For more information, see What are the differences between numpy arrays and matrices? Which one should I use?

For cases where you're dealing with more than 2 dimensions, take a look at the numpy.expand_dims function. Though the syntax is annoyingly redundant and unpythonically verbose, when I'm working on arrays with more than 2 dimensions (so cannot use matrix), I'm forever having to use expand_dims to do this kind of thing:

A -= numpy.expand_dims( A.mean( axis=2 ), 2 )   # subtract mean-across-layers from A

instead of

A -= A.mean( axis=2 )   # throw an exception while naively attempting to subtract mean-across-layers from A

But consider Matlab, by contrast. Matlab implicitly asserts that there is no such thing as a one-dimensional object and that the minimum number of dimensions a thing can ever have is 2. Sure, you and I are both highly accustomed to this, but take a moment to realize how arbitrary it is. There is clearly a conceptual difference between a fundamentally one-dimensional object, and a two-dimensional object that just happens to have extent 1 in one of its dimensions: the latter is allowed to grow in its second dimension, whereas the former doesn't even know what the second dimension means—and why should it? Hence a.shape==(N,) and a.shape==(N,1) make perfect sense as separate cases. You might as well ask "why is it not (N, 1, 1)?" or "why is it not (N, 1, 1, 1, 1, 1, 1)?"

Community
  • 1
  • 1
jez
  • 12,803
  • 2
  • 30
  • 54
  • Thank you jez. I read your answer and the links quoted. Now Numpy's convention appears more natural to me than MATLAB :-) – Taozi Jan 09 '15 at 20:22