70

I've been exploring how to optimize my code and ran across pandas .at method. Per the documentation

Fast label-based scalar accessor

Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.

So I ran some samples:

Setup

import pandas as pd
import numpy as np
from string import letters, lowercase, uppercase

lt = list(letters)
lc = list(lowercase)
uc = list(uppercase)

def gdf(rows, cols, seed=None):
    """rows and cols are what you'd pass
    to pd.MultiIndex.from_product()"""
    gmi = pd.MultiIndex.from_product
    df = pd.DataFrame(index=gmi(rows), columns=gmi(cols))
    np.random.seed(seed)
    df.iloc[:, :] = np.random.rand(*df.shape)
    return df

seed = [3, 1415]
df = gdf([lc, uc], [lc, uc], seed)

print df.head().T.head().T

df looks like:

            a                                        
            A         B         C         D         E
a A  0.444939  0.407554  0.460148  0.465239  0.462691
  B  0.032746  0.485650  0.503892  0.351520  0.061569
  C  0.777350  0.047677  0.250667  0.602878  0.570528
  D  0.927783  0.653868  0.381103  0.959544  0.033253
  E  0.191985  0.304597  0.195106  0.370921  0.631576

Lets use .at and .loc and ensure I get the same thing

print "using .loc", df.loc[('a', 'A'), ('c', 'C')]
print "using .at ", df.at[('a', 'A'), ('c', 'C')]

using .loc 0.37374090276
using .at  0.37374090276

Test speed using .loc

%%timeit
df.loc[('a', 'A'), ('c', 'C')]

10000 loops, best of 3: 180 µs per loop

Test speed using .at

%%timeit
df.at[('a', 'A'), ('c', 'C')]

The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop

This looks to be a huge speed increase. Even at the caching stage 6.11 * 8 is a lot faster than 180

Question

What are the limitations of .at? I'm motivated to use it. The documentation says it's similar to .loc but it doesn't behave similarly. Example:

# small df
sdf = gdf([lc[:2]], [uc[:2]], seed)

print sdf.loc[:, :]

          A         B
a  0.444939  0.407554
b  0.460148  0.465239

where as print sdf.at[:, :] results in TypeError: unhashable type

So obviously not the same even if the intent is to be similar.

That said, who can provide guidance on what can and cannot be done with the .at method?

Community
  • 1
  • 1
piRSquared
  • 240,659
  • 38
  • 359
  • 510
  • FYI: python3 version of letters, lowercase and uppercase constant are prepended with ascii_ https://docs.python.org/release/3.1.3/library/string.html#string-constants – thoroc Feb 28 '19 at 10:56

4 Answers4

64

Update: df.get_value is deprecated as of version 0.21.0. Using df.at or df.iat is the recommended method going forward.


df.at can only access a single value at a time.

df.loc can select multiple rows and/or columns.

Note that there is also df.get_value, which may be even quicker at accessing single values:

In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop

In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop

In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop

Under the hood, df.at[...] calls df.get_value, but it also does some type checking on the keys.

unutbu
  • 711,858
  • 148
  • 1,594
  • 1,547
  • 5
    Please note that `get_value` is deprecated since version 0.21.0. – Cleb Dec 30 '17 at 12:00
  • does this deprecation imply that we should use .loc instead of .at ? – Vicrobot Jun 22 '19 at 18:26
  • 1
    No, it doesn’t imply that. at is faster than loc but only works on individual cells. loc can work with arrays of cells – piRSquared Jun 22 '19 at 18:47
  • 1
    You may want to test this again. There is not much difference anymore. %timeit df.at[100, ['A']] = 10 661 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit df.loc[100, ['A']] = 10 645 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – Tarik Mar 24 '20 at 06:06
  • 1
    I'm finding that df.at does indeed access multiple rows if there are duplicated index values. – Tristan Brown Jun 04 '20 at 17:23
41

As you asked about the limitations of .at, here is one thing I recently ran into (using pandas 0.22). Let's use the example from the documentation:

df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df2 = df.copy()

    A   B   C
4   0   2   3
5   0   4   1
6  10  20  30

If I now do

df.at[4, 'B'] = 100

the result looks as expected

    A    B   C
4   0  100   3
5   0    4   1
6  10   20  30

However, when I try to do

 df.at[4, 'C'] = 10.05

it seems that .at tries to conserve the datatype (here: int):

    A    B   C
4   0  100  10
5   0    4   1
6  10   20  30

That seems to be a difference to .loc:

df2.loc[4, 'C'] = 10.05

yields the desired

    A   B      C
4   0   2  10.05
5   0   4   1.00
6  10  20  30.00

The risky thing in the example above is that it happens silently (the conversion from float to int). When one tries the same with strings it will throw an error:

df.at[5, 'A'] = 'a_string'

ValueError: invalid literal for int() with base 10: 'a_string'

It will work, however, if one uses a string on which int() actually works as noted by @n1k31t4 in the comments, e.g.

df.at[5, 'A'] = '123'

     A   B   C
4    0   2   3
5  123   4   1
6   10  20  30
Cleb
  • 20,118
  • 16
  • 91
  • 131
  • 3
    Good point! I've been debating where I should bring this up. – piRSquared Jun 08 '18 at 12:26
  • 2
    @piRSquared: Yes, would be good if this was documented anywhere (might have overlooked it though and I should try on 0.23). – Cleb Jun 08 '18 at 12:27
  • 1
    Happens on 0.23 as well. Try `df = pd.DataFrame([[1, 2], [3, 4]]), df.at[0, 1] = 'a'` – piRSquared Jun 08 '18 at 12:29
  • 2
    Thank you for this, it definitely does try to preserve the numeric datatype. I was trying `df.at[1, 'test'] = 'string'` and it wasn't working. `.loc` fixed the problem – Amon Aug 22 '18 at 18:25
  • 1
    So if you pass it anything that `int()` is able to consume, it will work... meaning you could do: `df.at[5, "A"] = "123"` and it would assign the string, converting it to int along the way. – n1k31t4 Jan 30 '19 at 21:34
  • @n1k31t4: Thanks for the comment and sorry for the very late reply. You are correct; I added your example to the answer. – Cleb Jul 03 '19 at 20:07
4

Adding to the above, Pandas documentation for the at function states:

Access a single value for a row/column label pair.

Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series.

For setting data loc and at are similar, for example:

df = pd.DataFrame({'A': [1,2,3], 'B': [11,22,33]}, index=[0,0,1])

Both loc and at will produce the same result

df.at[0, 'A'] = [101,102]
df.loc[0, 'A'] = [101,102]

    A   B
0   101 11
0   102 22
1   3   33

df.at[0, 'A'] = 103
df.loc[0, 'A'] = 103

    A   B
0   103 11
0   103 22
1   3   33

Also, for accessing a single value, both are the same

df.loc[1, 'A']   # returns a single value (<class 'numpy.int64'>)
df.at[1, 'A']    # returns a single value (<class 'numpy.int64'>)

3

However, when matching multiple values, loc will return a group of rows/cols from the DataFrame while at will return an array of values

df.loc[0, 'A']  # returns a Series (<class 'pandas.core.series.Series'>)

0    103
0    103
Name: A, dtype: int64

df.at[0, 'A']   # returns array of values (<class 'numpy.ndarray'>)

array([103, 103])

And more so, loc can be used to match a group of row/cols and can be given only an index, while at must receive the column

df.loc[0]  # returns a DataFrame view (<class 'pandas.core.frame.DataFrame'>)

    A   B
0   103 11
0   103 22


# df.at[0]  # ERROR: must receive column
emem
  • 4,897
  • 1
  • 19
  • 27
0

.at is an optimized data access method compared to .loc .

.loc of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Insetad, .at selects particular elemnt of a data frame positioned at the given indexed_row and labeled_column.

Also, .at takes one row and one column as input argument, whereas .loc may take multiple rows and columns. Oputput using .at is a single element and using .loc maybe a Series or a DataFrame.

Vikranth Inti
  • 75
  • 4
  • 13