8

I have 2 arrays in 2D, where the column vectors are feature vectors. One array is of size F x A, the other of F x B, where A << B. As an example, for A = 2 and F = 3 (B can be anything):

arr1 = np.array( [[1, 4],
                  [2, 5],
                  [3, 6]] )

arr2 = np.array( [[1, 4, 7, 10, ..],
                  [2, 5, 8, 11, ..],
                  [3, 6, 9, 12, ..]] )

I want to calculate the distance between arr1 and a fragment of arr2 that is of equal size (in this case, 3x2), for each possible fragment of arr2. The column vectors are independent of each other, so I believe I should calculate the distance between each column vector in arr1 and a collection of column vectors ranging from i to i + A from arr2 and take the sum of these distances (not sure though).

Does numpy offer an efficient way of doing this, or will I have to take slices from the second array and, using another loop, calculate the distance between each column vector in arr1 and the corresponding column vector in the slice?

Example for clarity, using the arrays stated above:

>>> magical_distance_func(arr1, arr2[:,:2])
[0, 10.3923..]
>>> # First, distance between arr2[:,:2] and arr1, which equals 0.
>>> # Second, distance between arr2[:,1:3] and arr1, which equals 
>>> diff = arr1 - np.array( [[4,7],[5,8],[6,9]] )
>>> diff
[[-3, -3], [-3, -3], [-3, -3]]
>>> # this happens to consist only of -3's. Norm of each column vector is:
>>> norm1 = np.linalg.norm([:,0])
>>> norm2 = np.linalg.norm([:,1])
>>> # would be extremely good if this worked for an arbitrary number of norms
>>> totaldist = norm1 + norm2
>>> totaldist
10.3923...

Of course, transposing the arrays is fine too, if that means that cdist can somehow be used here.

  • Interesting question, +1. May I ask what the relation between the two feature sets is? If there's no general solution, maybe a domain-specific solution can be found. – Fred Foo Jun 19 '12 at 15:14
  • The elements in the arrays indicate presence (or counts, if you will) of spatial features in an image. I am trying to find the closest match, so I guess it can be seen as a classification task. `arr1` is a short sequence of, in this case, 2 timesteps, which is compared to a document of B timesteps to find the index of the closest matching sequence in it. –  Jun 19 '12 at 17:42

3 Answers3

5

If I understand your question correctly, this will work. Knowing numpy, there's probably a better way, but this is at least fairly straightforward. I used some contrived coordinates to show that the calculation is working as expected.

>>> arr1
array([[0, 3],
       [1, 4],
       [2, 5]])
>>> arr2
array([[ 3,  6,  5,  8],
       [ 5,  8, 13, 16],
       [ 2,  5,  2,  5]])

You can subtract arr1 from arr2 by ensuring that they broadcast against each other correctly. The best way I could think of involves taking a transpose and doing some reshaping. These don't create copies -- they create views -- so this isn't so wasteful. (dist is a copy though.)

>>> dist = (arr2.T.reshape((2, 2, 3)) - arr1.T).reshape((4, 3))
>>> dist
array([[ 3,  4,  0],
       [ 3,  4,  0],
       [ 5, 12,  0],
       [ 5, 12,  0]])

Now all we have to do is apply numpy.linalg.norm across axis 1. (You can select from among several norms).

>>> numpy.apply_along_axis(numpy.linalg.norm, 1, dist)
array([  5.,   5.,  13.,  13.])

Assuming you want simple euclidean distance, you can also do it directly; not sure whether this will be faster or slower so try both:

>>> (dist ** 2).sum(axis=1) ** 0.5
array([  5.,   5.,  13.,  13.])

Based on your edit, we have to do only one small tweak. Since you want to test the columns pairwise, rather than blockwise, you need a rolling window. This can be done very simply with fairly straightforward indexing:

>>> arr2.T[numpy.array(zip(range(0, 3), range(1, 4)))]
array([[[ 3,  5,  2],
        [ 6,  8,  5]],

       [[ 6,  8,  5],
        [ 5, 13,  2]],

       [[ 5, 13,  2],
        [ 8, 16,  5]]])

Combining that with the other tricks:

>>> arr2_pairs = arr2.T[numpy.array(zip(range(0, 3), range(1, 4)))]
>>> dist = arr2_pairs - arr1.T
>>> (dist ** 2).sum(axis=2) ** 0.5
array([[  5.        ,   5.        ],
       [  9.69535971,   9.69535971],
       [ 13.        ,  13.        ]])

However, converting arrays from list comprehensions tends to be slow. It might be faster to use stride_tricks -- here again, see which one suits your purposes best:

>>> as_strided(arr2.T, strides=(8, 8, 32), shape=(3, 2, 3))
array([[[ 3,  5,  2],
        [ 6,  8,  5]],

       [[ 6,  8,  5],
        [ 5, 13,  2]],

       [[ 5, 13,  2],
        [ 8, 16,  5]]])

This actually manipulates the way numpy moves over a block of memory, allowing a small array to emulate a bigger array.

>>> arr2_pairs = as_strided(arr2.T, strides=(8, 8, 32), shape=(3, 2, 3))
>>> dist = arr2_pairs - arr1.T
>>> (dist ** 2).sum(axis=2) ** 0.5
array([[  5.        ,   5.        ],
       [  9.69535971,   9.69535971],
       [ 13.        ,  13.        ]])

So now you have a simple 2-d array corresponding to distances for each pair of columns. Now it's just a matter of getting the mean and calling argmin.

>>> normed = (dist ** 2).sum(axis=2) ** 0.5
>>> normed.mean(axis=1)
array([  5.        ,   9.69535971,  13.        ])
>>> min_window = normed.mean(axis=1).argmin()
>>> arr2[:,[min_window, min_window + 1]]
array([[3, 6],
       [5, 8],
       [2, 5]])
senderle
  • 125,265
  • 32
  • 201
  • 223
  • It's not exactly what I'm looking for, but it is amazing what you did by reshaping and I may need this in the near future, +1 to you. My apologies for not being as clear as I should be. The output should consist of only 3 values for the example arrays you give, as I'm looking for a "best match" given arr1 and each combination of the same size in arr2, i.e. which index (indices) in `arr2` makes it so that `dist( arr2[i:i+2], arr1 )` is the smallest? –  Jun 19 '12 at 18:02
  • Wow. So many functions I've never heard of, and probably would have found only after meticulous scanning of documentation. Thanks a lot! –  Jun 19 '12 at 20:31
2

You can get the distance matrix using cdist from scipy.spatial.distance. Once you have the distance matrix, you can just sum across columns and normalize to get the average distance, if that's what you're looking for.

Note: Instead of columns, cdist uses rows to compute the pairwise distances.

Here you have an example using the 'cosine' distance:

from scipy.spatial.distance import cdist

arr1 = np.array( [[1, 7],
                 [4, 8],
                 [4, 0]] )

arr2 = array( [[1, 9, 3, 6, 2],
              [3, 9, 0, 2, 3],
              [6, 0, 2, 7, 4]] )

# distance matrix
D = cdist( arr1.transpose(), arr2.transpose(), 'cosine' )

# average distance array (each position corresponds to each column of arr1)
d1 = D.mean( axis=1 )

# average distance array (each position corresponds to each column of arr2)
d2 = D.mean( axis=0 )

# Results
d1 = array([ 0.23180963,  0.35643282])
d2 = array([ 0.31018485,  0.19337869,  0.46050302,  0.3233269 ,  0.18321265])

There are many distances available. Check out the documentation.

Oriol Nieto
  • 4,870
  • 5
  • 27
  • 37
  • Thanks for the example, but I think it's not exactly what I'm looking for. What I'm looking for is something like this: `arr1 = [[1,2], [1,2]], arr2 = [[1,2],[1,2],[1,3]]` gives `[0, 1]`: 0 because the the first fragment `[[1,2],[1,2]]` of arr2 is equal to arr1, and 1 because the euclidean distance between `[1,2]` and `[1,2]` is 0 + distance between `[1,2]` and `[1,3]`, which is 1. –  Jun 19 '12 at 17:29
  • If you put these values into my example, and you use the 'euclidean' distance instead of 'cosine', you get the following: D = [[0,0,1],[0,0,1]]. Maybe you can use this result for your purpose? – Oriol Nieto Jun 19 '12 at 19:00
1

scipy.spatial.distance.cdist ?

reptilicus
  • 9,765
  • 5
  • 49
  • 71
  • I believe that calculates the euclidean distance between two arrays where each column in arr1 is compared to every column in arr2. –  Jun 19 '12 at 14:49