1

I am a beginner. I've looked all over and read a bunch of related questions but can't quite figure this out. I know I am the problem and that I'm missing something, but I'm hoping someone will be kind and help me out. I am attempting to convert data from one video game (a college basketball simulation) into data consistent with another video game's (pro basketball simulation) format.

I have a DF that has columns: Name, Pos, Height, Weight, Shot, Points

With values such as: Jon Smith, C, 84, 235, Exc, 19.4 Greg Jones, PG, 72, 187, Fair, 12.0

I want create a new column for "InsideScoring". What I'd like to do is assign a player a randomly generated number within a certain range based on what position they played, height, weight, shot rating and points scored.

I tried a bunch of attempts like:

df1['InsideScoring'] = 0
df1.loc[(df1.Pos == "C") &
        (df1.Height > 82) &
        (df1.Points > 19.0) &
        (df1.Weight > 229), 'InsideScoring'] = np.random.randint(85,100)

When I do this, all the players (row at column "InsideScoring") that meet the criteria get assigned the same value between 85 and 100 rather than a random mix of numbers between 85 and 100.

Eventually, what I want to do is go through the list of players and based on those four criteria, assign values from different ranges. Any ideas appreciated.

Pandas: Create a new column with random values based on conditional

Numpy "where" with multiple conditions

mgadfly
  • 65
  • 7

1 Answers1

1

My recommendation would be to use np.select here. You set up your conditions, your outputs, and you're good to go. However, to avoid iteration, but also to avoid assigning the same random value to every column that meets the condition, create random values equal to the length of your DataFrame, and select from those:


Setup

df = pd.DataFrame({
    'Name': ['Chris', 'John'],
    'Height': [72, 84],
    'Pos': ['PG', 'C'],
    'Weight': [165, 235], 
    'Shot': ['Amazing', 'Fair'],
    'Points': [999, 25]
})

    Name  Height Pos  Weight     Shot  Points
0  Chris      72  PG     165  Amazing     999
1   John      84   C     235     Fair      25

Now set up your ranges and your conditions (Create as many of these as you like):

cond1 = df.Pos.eq('C') & df.Height.gt(80) & df.Weight.gt(200)
cond2 = df.Pos.eq('PG') & df.Height.lt(80) & df.Weight.lt(200)

range1 = np.random.randint(85, 100, len(df))
range2 = np.random.randint(50, 85, len(df))

df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))

    Name  Height Pos  Weight     Shot  Points  InsideScoring
0  Chris      72  PG     165  Amazing     999             72
1   John      84   C     235     Fair      25             89

Now to verify this doesn't assign values more than once:

df = pd.concat([df]*5)

... # Setup the ranges and conditions again

df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))

    Name  Height Pos  Weight     Shot  Points  InsideScoring
0  Chris      72  PG     165  Amazing     999             56
1   John      84   C     235     Fair      25             96
0  Chris      72  PG     165  Amazing     999             74
1   John      84   C     235     Fair      25             93
0  Chris      72  PG     165  Amazing     999             63
1   John      84   C     235     Fair      25             97
0  Chris      72  PG     165  Amazing     999             55
1   John      84   C     235     Fair      25             95
0  Chris      72  PG     165  Amazing     999             60
1   John      84   C     235     Fair      25             90

And we can see that random values are assigned, even though they all match one of two conditions. While this is less memory efficient than iterating and picking a random value, since we are creating a lot of unused numbers, it will still be faster as these are vectorized operations.

user3483203
  • 45,503
  • 8
  • 43
  • 75
  • I tested it on a sample DF and it worked, so I think this is what I needed to understand. Thanks. However, when I tested it on the actual dataframe it gave me either: ValueError: shape mismatch: objects cannot be broadcast to a single shape or ValueError: operands could not be broadcast together with shapes (80640,) (105,). If you have any idea how I messed that up let me know ... but thank you, your answer is what I needed to know. – mgadfly Sep 17 '18 at 22:38
  • Did you forget to re-run your `conds` and `ranges` setup? You have to re-run that each time or else the ranges will be the incorrect shape – user3483203 Sep 17 '18 at 22:47
  • Yes. It now works. My lack of reputation prevents me from +1-ing your answer, but I really appreciate the help. This was really stumping me. – mgadfly Sep 17 '18 at 23:02
  • Glad to help, happy programming! – user3483203 Sep 17 '18 at 23:15