3

I am new to Python and have been using Excel. I am wanting to bring in IF statements from Excel. I have so far used xrld to import my sheets (yay!).

I know that for excel, we have:

IF(logical_test, [value_if_true], [value_if_false])

For Python, I want to show that if a value, A < 0, then square it and if value A > 0 (false), then cube it.

For Excel, this is:

IF(A<0, [A^2], [A^3])

Also, can this be done for values in multiple columns in Python? For Excel, I have to do each column separately, but just drag the cursor with the master equation down.

GeoRX
  • 33
  • 1
  • 4
  • 1
    Possible duplicate of [Does Python have a ternary conditional operator?](https://stackoverflow.com/questions/394809/does-python-have-a-ternary-conditional-operator) – Davis Herring Dec 07 '18 at 01:40
  • To be clear, you're looking to parse and execute an `IF` statement already defined in your Excel sheet? Or, rather, are you looking to translate that `IF` statement into something more native to Python entirely? – blacksite Dec 07 '18 at 01:41
  • Not sure what “columns in Python” are. Python code can be reused without being repeated in different places. – Davis Herring Dec 07 '18 at 01:41
  • I am looking to translate this IF statement that is more native to Python in a line of code. I am not worried about translating it from Excel sheet to Python as this is a simple if statement. I rather do it manually if that makes sense? – GeoRX Dec 07 '18 at 01:46

1 Answers1

1

Here is (not exhaustive) list of ways to do this in Python. Pandas, if you're familiar with it, is a data manipulation and analysis library. NumPy is a mathematical computing framework that pandas heavily uses.

A pandas.DataFrame is very similar, conceptually, to an Excel sheet. It's a very flexible collection/data structure. Let's construct a DataFrame with just one column:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'a': np.random.randn(10)})
>>> df
          a
0  0.829202
1 -2.722945
2  0.877674
3  0.981782
4  0.398093
5 -0.593295
6  0.871504
7  0.439916
8  1.026697
9 -0.063484

I've used np.random.randn to generate a sequence of random numbers.

Note that if you don't have access to third-party libraries, you can also store these data in a list or other built-in collection. Pandas is kind of the best tool, IMO, for this work, however.

NumPy example

Let's take a look at a NumPy solution. We can use numpy.where (abbreviated np.where) to operate in basically an identical fashion to the Excel IF you're used to:

>>> calc = np.where(df['a'] < 0, df['a'] ** 2, df['a'] ** 3)
>>> calc
array([5.70138715e-01, 7.41442961e+00, 6.76083380e-01, 9.46335721e-01,
       6.30887662e-02, 3.51999052e-01, 6.61924036e-01, 8.51350238e-02,
       1.08224773e+00, 4.03018586e-03])

Now we can assign calc, a NumPy array, back to the DataFrame df if we like:

>>> df['calc'] = calc
>>> df
          a      calc
0  0.829202  0.570139
1 -2.722945  7.414430
2  0.877674  0.676083
3  0.981782  0.946336
4  0.398093  0.063089
5 -0.593295  0.351999
6  0.871504  0.661924
7  0.439916  0.085135
8  1.026697  1.082248
9 -0.063484  0.004030

Pure-Python example:

>>> calc = [x ** 2 if x < 0 else x ** 3 for x in df['a']]
>>> calc
[0.5701387154526354, 7.414429614316651, 0.6760833798456796, 0.9463357209355919, 0.0630887661523053, 0.3519990516487755, 0.6619240363210077, 0.08513502375096661, 1.0822477332767624, 0.004030185858203847]
>>> df['calc'] = calc
>>> df
          a      calc
0  0.829202  0.570139
1 -2.722945  7.414430
2  0.877674  0.676083
3  0.981782  0.946336
4  0.398093  0.063089
5 -0.593295  0.351999
6  0.871504  0.661924
7  0.439916  0.085135
8  1.026697  1.082248
9 -0.063484  0.004030

This pure-Python example uses what's commonly referred to as a "ternary" operator - that's the x ** 2 if x < 0 else x ** 3 stuff. That line is just a more concise, "Pythonic" way of doing this:

>>> calc = []
>>> for x in df['a']:
...     if x < 0:
...             calc.append(x ** 2)
...     else:
...             calc.append(x ** 3)
... 
>>> calc
[0.5701387154526354, 7.414429614316651, 0.6760833798456796, 0.9463357209355919, 0.0630887661523053, 0.3519990516487755, 0.6619240363210077, 0.08513502375096661, 1.0822477332767624, 0.004030185858203847]

You'll often see the former approach - a "list comprehension" - because it's more concise than the latter approach. They are equivalent, from an output perspective, however.

We can also test these approaches out from a timing perspective:

>>> import time
>>> N = 100000
>>> 
>>> def test_numpy():
...     t0 = time.time()
...     for _ in range(N):
...         calc = np.where(df['a'] < 0, df['a'] ** 2, df['a'] ** 3)
...     t1 = time.time()
...     return t1 - t0
... 
>>> def test_list():
...     t0 = time.time()
...     for _ in range(N):
...         calc = [x ** 2 if x < 0 else x ** 3 for x in df['a']]
...     t1 = time.time()
...     return t1 - t0
... 
>>> test_numpy()
34.89216589927673
>>> test_list()
1.9015109539031982

It seems the list approach is quite a bit quicker. Looking a bit more deeply using the dis library, we can see the np.where approach has a bit more overhead in terms of what operations are executed:

>>> import dis
>>> dis.dis("np.where(df['a'] < 0, df['a'] ** 2, df['a'] ** 3)")
  1           0 LOAD_NAME                0 (np)
              2 LOAD_ATTR                1 (where)
              4 LOAD_NAME                2 (df)
              6 LOAD_CONST               0 ('a')
              8 BINARY_SUBSCR
             10 LOAD_CONST               1 (0)
             12 COMPARE_OP               0 (<)
             14 LOAD_NAME                2 (df)
             16 LOAD_CONST               0 ('a')
             18 BINARY_SUBSCR
             20 LOAD_CONST               2 (2)
             22 BINARY_POWER
             24 LOAD_NAME                2 (df)
             26 LOAD_CONST               0 ('a')
             28 BINARY_SUBSCR
             30 LOAD_CONST               3 (3)
             32 BINARY_POWER
             34 CALL_FUNCTION            3
             36 RETURN_VALUE
>>> dis.dis("[x ** 2 if x < 0 else x ** 3 for x in df['a']]")
  1           0 LOAD_CONST               0 (<code object <listcomp> at 0x10fdcd1e0, file "<dis>", line 1>)
              2 LOAD_CONST               1 ('<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_NAME                0 (df)
              8 LOAD_CONST               2 ('a')
             10 BINARY_SUBSCR
             12 GET_ITER
             14 CALL_FUNCTION            1
             16 RETURN_VALUE
blacksite
  • 10,028
  • 6
  • 44
  • 94
  • 1
    Thanks! I really appreciate the pure-python method and see how it is broken down. I am not familiar with Pandas but I have heard positive things about it. The numpy way really does look just like Excel! – GeoRX Dec 07 '18 at 02:00