Here is (not exhaustive) list of ways to do this in Python. Pandas, if you're familiar with it, is a data manipulation and analysis library. NumPy is a mathematical computing framework that pandas heavily uses.
A pandas.DataFrame
is very similar, conceptually, to an Excel sheet. It's a very flexible collection/data structure. Let's construct a DataFrame with just one column:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'a': np.random.randn(10)})
>>> df
a
0 0.829202
1 -2.722945
2 0.877674
3 0.981782
4 0.398093
5 -0.593295
6 0.871504
7 0.439916
8 1.026697
9 -0.063484
I've used np.random.randn
to generate a sequence of random numbers.
Note that if you don't have access to third-party libraries, you can also store these data in a list or other built-in collection. Pandas is kind of the best tool, IMO, for this work, however.
NumPy example
Let's take a look at a NumPy solution. We can use numpy.where
(abbreviated np.where
) to operate in basically an identical fashion to the Excel IF
you're used to:
>>> calc = np.where(df['a'] < 0, df['a'] ** 2, df['a'] ** 3)
>>> calc
array([5.70138715e-01, 7.41442961e+00, 6.76083380e-01, 9.46335721e-01,
6.30887662e-02, 3.51999052e-01, 6.61924036e-01, 8.51350238e-02,
1.08224773e+00, 4.03018586e-03])
Now we can assign calc
, a NumPy array, back to the DataFrame df
if we like:
>>> df['calc'] = calc
>>> df
a calc
0 0.829202 0.570139
1 -2.722945 7.414430
2 0.877674 0.676083
3 0.981782 0.946336
4 0.398093 0.063089
5 -0.593295 0.351999
6 0.871504 0.661924
7 0.439916 0.085135
8 1.026697 1.082248
9 -0.063484 0.004030
Pure-Python example:
>>> calc = [x ** 2 if x < 0 else x ** 3 for x in df['a']]
>>> calc
[0.5701387154526354, 7.414429614316651, 0.6760833798456796, 0.9463357209355919, 0.0630887661523053, 0.3519990516487755, 0.6619240363210077, 0.08513502375096661, 1.0822477332767624, 0.004030185858203847]
>>> df['calc'] = calc
>>> df
a calc
0 0.829202 0.570139
1 -2.722945 7.414430
2 0.877674 0.676083
3 0.981782 0.946336
4 0.398093 0.063089
5 -0.593295 0.351999
6 0.871504 0.661924
7 0.439916 0.085135
8 1.026697 1.082248
9 -0.063484 0.004030
This pure-Python example uses what's commonly referred to as a "ternary" operator - that's the x ** 2 if x < 0 else x ** 3
stuff. That line is just a more concise, "Pythonic" way of doing this:
>>> calc = []
>>> for x in df['a']:
... if x < 0:
... calc.append(x ** 2)
... else:
... calc.append(x ** 3)
...
>>> calc
[0.5701387154526354, 7.414429614316651, 0.6760833798456796, 0.9463357209355919, 0.0630887661523053, 0.3519990516487755, 0.6619240363210077, 0.08513502375096661, 1.0822477332767624, 0.004030185858203847]
You'll often see the former approach - a "list comprehension" - because it's more concise than the latter approach. They are equivalent, from an output perspective, however.
We can also test these approaches out from a timing perspective:
>>> import time
>>> N = 100000
>>>
>>> def test_numpy():
... t0 = time.time()
... for _ in range(N):
... calc = np.where(df['a'] < 0, df['a'] ** 2, df['a'] ** 3)
... t1 = time.time()
... return t1 - t0
...
>>> def test_list():
... t0 = time.time()
... for _ in range(N):
... calc = [x ** 2 if x < 0 else x ** 3 for x in df['a']]
... t1 = time.time()
... return t1 - t0
...
>>> test_numpy()
34.89216589927673
>>> test_list()
1.9015109539031982
It seems the list
approach is quite a bit quicker. Looking a bit more deeply using the dis
library, we can see the np.where
approach has a bit more overhead in terms of what operations are executed:
>>> import dis
>>> dis.dis("np.where(df['a'] < 0, df['a'] ** 2, df['a'] ** 3)")
1 0 LOAD_NAME 0 (np)
2 LOAD_ATTR 1 (where)
4 LOAD_NAME 2 (df)
6 LOAD_CONST 0 ('a')
8 BINARY_SUBSCR
10 LOAD_CONST 1 (0)
12 COMPARE_OP 0 (<)
14 LOAD_NAME 2 (df)
16 LOAD_CONST 0 ('a')
18 BINARY_SUBSCR
20 LOAD_CONST 2 (2)
22 BINARY_POWER
24 LOAD_NAME 2 (df)
26 LOAD_CONST 0 ('a')
28 BINARY_SUBSCR
30 LOAD_CONST 3 (3)
32 BINARY_POWER
34 CALL_FUNCTION 3
36 RETURN_VALUE
>>> dis.dis("[x ** 2 if x < 0 else x ** 3 for x in df['a']]")
1 0 LOAD_CONST 0 (<code object <listcomp> at 0x10fdcd1e0, file "<dis>", line 1>)
2 LOAD_CONST 1 ('<listcomp>')
4 MAKE_FUNCTION 0
6 LOAD_NAME 0 (df)
8 LOAD_CONST 2 ('a')
10 BINARY_SUBSCR
12 GET_ITER
14 CALL_FUNCTION 1
16 RETURN_VALUE