15

I have looked this answer which explains how to compute the value of a specific percentile, and this answer which explains how to compute the percentiles that correspond to each element.

  • Using the first solution, I can compute the value and scan the original array to find the index.

  • Using the second solution, I can scan the entire output array for the percentile I'm looking for.

However, both require an additional scan if I want to know the index (in the original array) that corresponds to a particular percentile (or the index containing the element closest to that index).

Is there is more direct or built-in way to get the index which corresponds to a percentile?

Note: My array is not sorted and I want the index in the original, unsorted array.

Community
  • 1
  • 1
merlin2011
  • 63,368
  • 37
  • 161
  • 279

6 Answers6

8

It is a little convoluted, but you can get what you are after with np.argpartition. Lets take an easy array and shuffle it:

>>> a = np.arange(10)
>>> np.random.shuffle(a)
>>> a
array([5, 6, 4, 9, 2, 1, 3, 0, 7, 8])

If you want to find e.g. the index of quantile 0.25, this would correspond to the item in position idx of the sorted array:

>>> idx = 0.25 * (len(a) - 1)
>>> idx
2.25

You need to figure out how to round that to an int, say you go with nearest integer:

>>> idx = int(idx + 0.5)
>>> idx
2

If you now call np.argpartition, this is what you get:

>>> np.argpartition(a, idx)
array([7, 5, 4, 3, 2, 1, 6, 0, 8, 9], dtype=int64)
>>> np.argpartition(a, idx)[idx]
4
>>> a[np.argpartition(a, idx)[idx]]
2

It is easy to check that these last two expressions are, respectively, the index and the value of the .25 quantile.

Jaime
  • 59,107
  • 15
  • 108
  • 149
  • +1; FWIW, your answer would be more obviously correct if `a` wasn't a shuffle of `argpartion(a, idx)`. – Veedrac Sep 27 '14 at 17:17
  • Does this work if the values in the list repeats? `y = [0, 0, 0, 2, 2, 4, 5, 5, 9]` and `int(0.75 * (len(y) -1 ) + 0.5) == 6` and `y[np.argpartition(y, 6)[6]]` outputs 5 and `y[5]` -> 4 =( – alvas Mar 16 '17 at 02:30
5

If numpy is to be used, one can also use the built-in percentile function. From version 1.9.0 of numpy, percentile has the option "interpolation" that allows you to pick out the lower/higher/nearest percentile value. The following will work on unsorted arrays and finds the nearest percentile index:

import numpy as np
p=70 # my desired percentile, here 70% 
x=np.random.uniform(10,size=(1000))-5.0  # dummy vector

# index of array entry nearest to percentile value
pcen=np.percentile(x,p,interpolation='nearest')
i_near=abs(x-pcen).argmin()

Most people will normally want the nearest percentile value as stated above. But just for completeness, you can also easily specify to get the entry below or above the stated percentile value:

# Use this to get index of array entry greater than percentile value:
pcen=np.percentile(x,p,interpolation='higher')

# Use this to get index of array entry smaller than percentile value:
pcen=np.percentile(x,p,interpolation='lower')

For OLD versions of numpy < v1.9.0, the interpolation option is not available, and thus the equivalent is this:

# Calculate 70th percentile:
pcen=np.percentile(x,p)
i_high=np.asarray([i-pcen if i-pcen>=0 else x.max()-pcen for i in x]).argmin()
i_low=np.asarray([i-pcen if i-pcen<=0 else x.min()-pcen for i in x]).argmax()
i_near=abs(x-pcen).argmin()

In summary:

i_high points to the array entry which is the next value equal to, or greater than, the requested percentile.

i_low points to the array entry which is the next value equal to, or smaller than, the requested percentile.

i_near points to the array entry that is closest to the percentile, and can be larger or smaller.

My results are:

pcen

2.3436832738049946

x[i_high]

2.3523077864975441

x[i_low]

2.339987054079617

x[i_near]

2.339987054079617

i_high,i_low,i_near

(876, 368, 368)

i.e. location 876 is the closest value exceeding pcen, but location 368 is even closer, but slightly smaller than the percentile value.

Adrian Tompkins
  • 4,261
  • 1
  • 23
  • 56
  • 2
    Regarding the solution `i_near=abs(x-np.percentile(x,p,interpolation='nearest')).argmin()` it is much faster to do `y=np.percentile(x,p,interpolation='nearest') i_near=abs(x-y).argmin()` and even a little bit faster to do `y=np.percentile(x,p,interpolation='nearest') i_near=np.where(x==A).argmin()` – toliveira Apr 30 '18 at 09:40
  • thanks you are right, I will update to include this – Adrian Tompkins Oct 16 '20 at 12:16
3

You can use numpy's np.percentile as such.:

import numpy as np

percentile = 75
mylist = [random.random() for i in range(100)] # random list

percidx = mylist.index(np.percentile(mylist, percentile, interpolation='nearest'))
runDOSrun
  • 8,482
  • 5
  • 38
  • 50
2

Using numpy,

arr = [12, 19, 11, 28, 10]
p = 0.75
np.argsort(arr)[int((len(arr) - 1) * p)]

This returns 11, as desired.

sharma0611
  • 47
  • 4
1

Assuming the array is sorted... Unless I'm misunderstanding you, you can compute the index of a percentile by taking the length of the array -1, multiplying it by the quantile, and rounding to the nearest integer.

round( (len(array) - 1) * (percentile / 100.) )

should give you the nearest index to that percentile

Gregory Nisbet
  • 5,950
  • 3
  • 21
  • 50
  • My array is not sorted and I want the index in the original array. I updated the question to clarify. – merlin2011 Sep 27 '14 at 02:09
  • Would sorting the array, finding the element at the index nearest the `quantile * (length - 1)` and then finding its index in the original array solve the problem? – Gregory Nisbet Sep 27 '14 at 02:10
  • Finding the index in the original array by a linear search would amount to doing one of the two solutions I already listed in the question. :) – merlin2011 Sep 27 '14 at 02:12
  • Okay, you could zip the original elements with their index `enumerate`, sort by the second element, and then take the quantile * last element. If the original array is unsorted, it's not clear to me that you can avoid doing at least O(n*log(n)) work – Gregory Nisbet Sep 27 '14 at 02:15
  • I've tested this a bit and, rather than `round( (len(array) - 1) * (percentile / 100.) )` wouldn't the correct formula be: `round( len(array) * (percentile / 100.) ) - 1` ? Basically removing 1 from the index at the end instead of from the length. – guival Aug 25 '16 at 10:49
1

You can select the values in a df in a designated quantile with df.quantile().

df_metric_95th_percentile = df.metric[df >= df['metric'].quantile(q=0.95)]
metaditch
  • 63
  • 1
  • 7