4

I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.

I can get the max across rows or columns by changing the first argument of

df.idxmax()

however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.

For example, I can do this in numpy:

>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))

But when I try something similar in pandas:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
    a   b   c
d NaN NaN NaN
e NaN   9 NaN
f NaN NaN NaN

At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.

E.g. for the above I'd like a function which does:

>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series

or even just

>>>topn(df,3)
(['b','c','b'],['e','f','f'])

a la numpy.where()

watsonic
  • 2,512
  • 1
  • 23
  • 29

3 Answers3

3

I figured out the first part:

npa = df.as_matrix()   
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx]) 

Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.

def topn(df,n):
    npa = df.as_matrix()   
    topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
    topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
    cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
    return ([df.columns[c] for c in cols],[df.index[i] for i in indx]) 

Trying this on the example:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])

As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.

Community
  • 1
  • 1
watsonic
  • 2,512
  • 1
  • 23
  • 29
  • this seems not to work for arbitrary DataFrames, `topn(pd.DataFrame(np.random.randn(40,5)),4)` gives me an `IndexError: index out of bounds` error – greole Nov 12 '14 at 12:41
3

what you want to use is stack

df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)

e  b    9
f  c    8
   b    7
   a    6
dtype: int64
acushner
  • 8,186
  • 1
  • 27
  • 30
1

I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.

>>> def topn(df,n):
       # pull the data ouit of the DataFrame
       # and flatten it to an array
       vals = df.values.flatten(order='F')
       # next we sort the array and store the sort mask
       p = np.argsort(vals)
       # create two arrays with the column names and indexes
       # in the same order as vals
       cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
       idxs = np.array([list(df.index) for idx in df.index]).flatten()
       # sort and return cols, and idxs
       return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]

>>> topn(df,3)
(array(['b', 'c', 'b'], 
      dtype='|S1'),
 array(['e', 'f', 'f'], 
      dtype='|S1'))


>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop

watsonics solution takes slightly less

%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop

but way faster than stack

def topStack(df,n):
    df = df.stack()
    df.sort(ascending=False)
    return df.head(n)

 %timeit(topStack(df,3))
 1000 loops, best of 3: 1.91 ms per loop
greole
  • 3,766
  • 5
  • 24
  • 48
  • sorry but this is wrong in several ways and doesn't address my question. i added some explicit examples to make it more clear. also fyi `df.describe().loc['max']` is much easier done as `df.max()` – watsonic Nov 12 '14 at 08:18
  • hey @greole, cool a completely different way! for the communities benefit can you explain a bit about your code? some lines are somewhat cryptic, e.g. `*[[col]*len(df) for col in df.columns])` and `MultiIndex.from_tuples(zip(cols,val.index))` maybe take it piece by piece or supplement with comments? also if you package this as a function, we can compare runtimes... thanks for the different way! – watsonic Nov 12 '14 at 11:44
  • @watsonic ok I refactored it a bit and added some comments. Avoiding to operate on DataFrame it self gave a huge speed benefit. – greole Nov 13 '14 at 08:26