12

Following some online research (1, 2, numpy, scipy, scikit, math), I have found several ways for calculating the Euclidean Distance in Python:

# 1
numpy.linalg.norm(a-b)

# 2
distance.euclidean(vector1, vector2)

# 3
sklearn.metrics.pairwise.euclidean_distances  

# 4
sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)

# 5
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))

# 6
math.hypot(x, y)

I was wondering if someone could provide an insight on which of the above (or any other that I have not found) is considered the best in terms of efficiency and precision. If someone is aware of any resource(s) which discusses the subject that would also be great.

The context I am interesting in is in calculating the Euclidean Distance between pairs of number-tuples, e.g. the distance between (52, 106, 35, 12) and (33, 153, 75, 10).

Community
  • 1
  • 1
  • 2
    Don't forget the built-in [`math.hypot()`](https://docs.python.org/3/library/math.html#math.hypot). You can easily test the speeds using the [`timeit`](https://docs.python.org/3/library/timeit.html#module-timeit) module. – martineau Jun 13 '16 at 17:02
  • 1
    @martineau Great suggestion, had no idea that such a built-in method existed! (edited my question to include it) –  Jun 13 '16 at 18:51
  • Possible caveat with `math.hypot()` is that it only does 2D vectors, whereas many of the others you mention can handle vectors of 3 or more dimensions. On the other hand, if all you're doing is 2D, the non-generalized built-in might have a speed advantage. – martineau Jun 13 '16 at 19:25
  • @martineau Interesting caveat although for my case it may be ideal. Possibly naive question: when calculating the euclidean distance between `(52, 106, 35, 12)` and `(33, 153, 75, 10)`, these two are 4D vectors?? –  Jun 13 '16 at 19:29
  • All depends on how you're interpreting them in the program. Could be two 4D vectors or four 2D vectors...the former seems most likely — I can't tell from your sample code. – martineau Jun 13 '16 at 19:38
  • @martineau Well, all I am interested in is comparing the euclidean distance of the `x1` with `y1`, `x2` with `y2` etc., where `(x1, x2, x3, x4)` and `(y1, y2, y3, y4)`. And I may have more that 4 values on the tuples. Could you please help me in understanding what kind of dimension vectors I need? –  Jun 13 '16 at 19:49
  • Good news, what you described are three 2D vectors between 4 points which `math.hypot()` and handle fine. – martineau Jun 13 '16 at 22:37
  • @martineau Don't you mean two 2D vectors between 4 points? –  Jun 14 '16 at 09:33
  • No, I meant the three 2D vectors defined by the pairs of endpoints between (52,33) and (106,153), (106,153) and (35,75), plus (35,75) and (12,10). Perhaps you should edit your question and show the desired results. – martineau Jun 14 '16 at 14:54

5 Answers5

16

Conclusion first:

From the test result by using timeit for efficiency test, we can conclude that regarding the efficiency:

Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm) > Method2 (scipy.spatial.distance) > Method3 (sklearn.metrics.pairwise.euclidean_distances )

While I didn't really test your Method4 as it is not suitable for general cases and it is generally equivalent to Method5.

For the rest, quite surprisingly, Method5 is the fastest one. While for Method1 which uses numpy, as what we expected, which is heavily optimized in C, is the second fastest.

For scipy.spatial.distance, if you go directly to the function definition, you will see that it is actually using numpy.linalg.norm, except it will perform the validation on the two input vectors before the actual numpy.linalg.norm. That's why it is slightly slower thant numpy.linalg.norm.

Finally for sklearn, according to the documentation:

This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed. However, this is not the most precise way of doing this computation, and the distance matrix returned by this function may not be exactly symmetric as required

Since in your question you would like to use a fixed set of data, the advantage of this implementation is not reflected. And due to the trade off between the performance and precision, it also gives the worst precision among all of the methods.

Regarding the precision, Method5=Metho1=Method2>Method3

Efficiency Test Script:

import numpy as np
from scipy.spatial import distance
from sklearn.metrics.pairwise import euclidean_distances
import math

# 1
def eudis1(v1, v2):
    return np.linalg.norm(v1-v2)

# 2
def eudis2(v1, v2):
    return distance.euclidean(v1, v2)

# 3
def eudis3(v1, v2):
    return euclidean_distances(v1, v2)

# 5
def eudis5(v1, v2):
    dist = [(a - b)**2 for a, b in zip(v1, v2)]
    dist = math.sqrt(sum(dist))
    return dist

dis1 = (52, 106, 35, 12)
dis2 = (33, 153, 75, 10)
v1, v2 = np.array(dis1), np.array(dis2)

import timeit

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

wrappered1 = wrapper(eudis1, v1, v2)
wrappered2 = wrapper(eudis2, v1, v2)
wrappered3 = wrapper(eudis3, v1, v2)
wrappered5 = wrapper(eudis5, v1, v2)
t1 = timeit.repeat(wrappered1, repeat=3, number=100000)
t2 = timeit.repeat(wrappered2, repeat=3, number=100000)
t3 = timeit.repeat(wrappered3, repeat=3, number=100000)
t5 = timeit.repeat(wrappered5, repeat=3, number=100000)

print('\n')
print('t1: ', sum(t1)/len(t1))
print('t2: ', sum(t2)/len(t2))
print('t3: ', sum(t3)/len(t3))
print('t5: ', sum(t5)/len(t5))

Efficiency Test Output:

t1:  0.654838958307
t2:  1.53977598714
t3:  6.7898791732
t5:  0.422228400305

Precision Test Script & Result:

In [8]: eudis1(v1,v2)
Out[8]: 64.60650122085238

In [9]: eudis2(v1,v2)
Out[9]: 64.60650122085238

In [10]: eudis3(v1,v2)
Out[10]: array([[ 64.60650122]])

In [11]: eudis5(v1,v2)
Out[11]: 64.60650122085238
MaThMaX
  • 1,901
  • 1
  • 8
  • 22
  • 1
    Please add the built-in [`math.hypot()`](https://docs.python.org/3/library/math.html#math.hypot). (The OP is using Python 3, BTW). – martineau Jun 13 '16 at 17:07
  • @MaThMaX Great stuff! As @martineau suggests, if you could add the built-in `math.hypot()` that would be amazing. Especially since I have never used/heard of it before. –  Jun 13 '16 at 18:48
  • When computing distance between small size of vectors, the performance efficiency is Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm). However, when I tested size of vector more than 128, Method1 > Method5 – RyanLiu Apr 22 '20 at 04:18
  • With respect to `sklearn` and the documentation: The computational advantage only shows up for larger distance matrices. The benchmark essentially tests a single distance value, but what if you have 1000's of points and want to compute pairwise distances between all of them and store the result in a matrix? This is the scenario when `sklearn` becomes superior (at the loss of precision) - as indicated in the docs. – no_use123 May 04 '20 at 16:02
3

This is not exactly answering the question, but it is probably worth mentioning that if you aren't interested in the actual euclidean distance, but just want to compare euclidean distances against each other, square roots are monotone functions, i.e. x**(1/2) < y**(1/2) if and only if x < y.

So if you don't want the explicit distance, but for instance just want to know if the euclidean distance of vector1 is closer to a list of vectors, called vectorlist, you can avoid the expensive (in terms of both precision and time) square root, but can make do with something like

min(vectorlist, key = lambda compare: sum([(a - b)**2 for a, b in zip(vector1, compare)])

1

As a general rule of thumb, stick to the scipy and numpy implementations where possible, as they're vectorized and much faster than native Python code. (Main reasons are: implementations in C, vectorization eliminates type checking overhead that looping does.)

(Aside: My answer doesn't cover precision here, but I think the same principle applies for precision as for efficiency.)

As a bit of a bonus, I'll chip in with a bit of information on how you can profile your code, to measure efficiency. If you're using the IPython interpreter, the secret is to use the %prun line magic.

In [1]: import numpy

In [2]: from scipy.spatial import distance

In [3]: c1 = numpy.array((52, 106, 35, 12))

In [4]: c2 = numpy.array((33, 153, 75, 10))

In [5]: %prun distance.euclidean(c1, c2)
         35 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        6    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        4    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 distance.py:232(euclidean)
        2    0.000    0.000    0.000    0.000 distance.py:152(_validate_vector)
        2    0.000    0.000    0.000    0.000 shape_base.py:9(atleast_1d)
        1    0.000    0.000    0.000    0.000 misc.py:11(norm)
        1    0.000    0.000    0.000    0.000 function_base.py:605(asarray_chkfinite)
        2    0.000    0.000    0.000    0.000 numeric.py:476(asanyarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'squeeze' of 'numpy.ndarray' objects}


In [6]: %prun numpy.linalg.norm(c1 - c2)
         10 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

What %prun does is tell you how long a function call takes to run, including a bit of trace to figure out where the bottleneck might be. In this case, both the scipy.spatial.distance.euclidean and numpy.linalg.norm implementations are pretty fast. Assuming you defined a function dist(vect1, vect2), you can profile using the same IPython magic call. As another added bonus, %prun also works inside the Jupyter notebook, and you can do %%prun to profile an entire cell of code, rather than just one function, simply by making %%prun the first line of that cell.

ericmjl
  • 10,678
  • 8
  • 43
  • 70
0

I don't know how the precision and speed compares to the other libraries you mentioned, but you can do it for 2D vectors using the built-in math.hypot() function:

from math import hypot

def pairwise(iterable):
    "s -> (s0, s1), (s1, s2), (s2, s3), ..."
    a, b = iter(iterable), iter(iterable)
    next(b, None)
    return zip(a, b)

a = (52, 106, 35, 12)
b = (33, 153, 75, 10)

dist = [hypot(p2[0]-p1[0], p2[1]-p1[1]) for p1, p2 in pairwise(tuple(zip(a, b)))]
print(dist)  # -> [131.59027319676787, 105.47511554864494, 68.94925670375281]
martineau
  • 99,260
  • 22
  • 139
  • 249
  • Thanks for this, I will try to test and time it. Could you briefly explain what the `pairwise` method does? –  Jun 14 '16 at 14:55
  • 1
    Sure. The `pairwise()` function is a slight variation on the one shown in the [itertools recipes](https://docs.python.org/3/library/itertools.html#itertools-recipes) documentation. It and the original return pairs of values from the iterable argument it's passed in the order shown in its doc string at the very beginning of the function. – martineau Jun 14 '16 at 15:10
0

Here is an example on how to use just numpy.

import numpy as np

a = np.array([3, 0])
b = np.array([0, 4])

c = np.sqrt(np.sum(((a - b) ** 2)))
# c == 5.0
Vlad Bezden
  • 59,971
  • 18
  • 206
  • 157