Numpy "where" with multiple conditions

Question

I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200. I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.

Any idea to help me please?

Thank you in advance

Why just don't use `numpy.select()`? – banderlog013 Feb 01 '21 at 13:21 — banderlog013, Feb 01 '21 at 13:21

Merlin · Answer 1 · 2020-09-17T19:32:52.133

72

Try this: Using the setup from @Maxu

col         = 'consumption_energy'
conditions  = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices     = [ "high", 'medium', 'low' ]
    
df2["energy_class"] = np.select(conditions, choices, default=np.nan)


  consumption_energy energy_class
0                 459         high
1                 416         high
2                 186          low
3                 250       medium
4                 411         high
5                 210       medium
6                 343       medium
7                 328       medium
8                 208       medium
9                 223       medium

edited Sep 17 '20 at 19:32

answered Aug 23 '16 at 22:53

Merlin

19,645
34
108
190

9

This is fantastic. As a note, this works like most `if / elif / else` functions in that if the first _and_ the second conditions are met, the first will apply and not the second. Thanks @Merlin – elPastor Mar 07 '19 at 23:55
I am trying to use this setup but in my condition I am checking if the column contains certain string values and this is not working and I am getting the error (ValueError: The truth value of a Series is ambiguous.). Any ideas why? Here is my question: https://stackoverflow.com/questions/64806956/manipulating-dataset-feature-using-string-comparison-in-python – sums22 Nov 13 '20 at 11:12
The best answer! – Jazz Dec 17 '20 at 13:44

score 59 · Accepted Answer · edited Jun 07 '18 at 11:17

59

You can use a ternary:

np.where(consumption_energy > 400, 'high', 
         (np.where(consumption_energy < 200, 'low', 'medium')))

edited Jun 07 '18 at 11:17

Mel

4,929
10
33
39

answered Aug 23 '16 at 19:17

Alexander

87,529
23
162
169

1

I think this is great, except if you have complex multiple condition, then this can be a big issue. – xbeta Dec 08 '20 at 02:53
1

Agreed. For anything more complex, I would use the `pd.cut` solution by @Maxu. – Alexander Dec 08 '20 at 07:03

score 17 · Answer 3 · edited Jan 23 '20 at 06:31

17

I would use the cut() method here, which will generate very efficient and memory-saving category dtype:

In [124]: df
Out[124]:
   consumption_energy
0                 459
1                 416
2                 186
3                 250
4                 411
5                 210
6                 343
7                 328
8                 208
9                 223

In [125]: pd.cut(df.consumption_energy,
                 [0, 200, 400, np.inf],
                 labels=['low','medium','high']
          )
Out[125]:
0      high
1      high
2       low
3    medium
4      high
5    medium
6    medium
7    medium
8    medium
9    medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]

edited Jan 23 '20 at 06:31

yoonghm

2,558
18
33

answered Aug 23 '16 at 20:16

MaxU

173,524
24
290
329

Hey max, in this method is 200 included in medium or low? – Umar.H Nov 01 '18 at 06:47
@Datanovice, it should fall under “medium “ category because all right edges of intervals are included per default – MaxU Nov 01 '18 at 06:53
1

@Datanovice, actually I’m wrong - it should fall under the “low” category as it is a right edge of the first interval – MaxU Nov 01 '18 at 07:06

score 16 · Answer 4 · 2017-11-21T14:06:11.140

I like to keep the code clean. That's why I prefer np.vectorize for such tasks.

def conditions(x):
    if x > 400:
        return "High"
    elif x > 200:
        return "Medium"
    else:
        return "Low"

func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])

Then just add numpy array as a column in your dataframe using:

df_energy["energy_class"] = energy_class

The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily. Hope it helps.

wpmoradi · Answer 5 · 2018-10-11T20:51:56.960

I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.

# Vectorizing with numpy 
row_dic = {'Condition1':'high',
          'Condition2':'medium',
          'Condition3':'low',
          'Condition4':'lowest'}

def Conditions(dfSeries_element,dictionary):
    '''
    dfSeries_element is an element from df_series 
    dictionary: is the dictionary of your conditions with their outcome
    '''
    if dfSeries_element in dictionary.keys():
        return dictionary[dfSeries]

def VectorizeConditions():
    func = np.vectorize(Conditions)
    result_vector = func(df['Series'],row_dic)
    df['new_Series'] = result_vector

    # running the below function will apply multi conditional formatting to your df
VectorizeConditions()

Bhishan Poudel · Answer 6 · 2019-04-23T14:36:04.280

WARNING: Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.

Consider this situation:

df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high', 
         (np.where(df.consumption_energy < 200, 'low', 'medium')))

# if we do not use this second line, then
#  if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan

Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.

IMHO best way to go is pd.cut. It deals with NaNs and easy to use.

Examples:

import numpy as np
import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')

# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])


# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan

# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
                         (np.where(df.age <20, 'child',
                                   np.where(df.age.isnull(), np.nan, 'medium'))))

# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
    age age_cat age_cat2 age_cat3
0  22.0  medium   medium   medium
1  38.0  medium   medium   medium
2  26.0  medium   medium   medium
3  35.0  medium   medium   medium
4  35.0  medium   medium   medium
5   NaN     NaN   medium      nan
6  54.0  medium   medium   medium

blanck · Answer 7 · 2021-04-29T16:47:40.703

-1

myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))

when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.

edited Apr 29 '21 at 16:47

answered Apr 27 '21 at 17:54

blanck

1
2

Numpy "where" with multiple conditions

7 Answers7

Examples:

Linked

Related