Home > Software engineering >  Create new column based on max value of groupby pandas
Create new column based on max value of groupby pandas

Time:10-12

I'm trying to create a new column based on a groupby function, but I'm running into an error. In the sample dataframe below, I want to create a new column where there is a new integer only in rows correspond to the max seq variable per user. So, for instance, user122 would only have a number in the 3rd row, where seq is 3 (this users highest seq number).

df = pd.DataFrame({
  'user': 
  {0: 'user122',
  1: 'user122',
  2: 'user122',
  3: 'user124',
  4: 'user125',
  5: 'user125',
  6: 'user126',
  7: 'user126',
  8: 'user126'},
  'baseline': 
  {0: 4.0,
  1: 4.0,
  2: 4.0,
  3: 2,
  4: 4,
  5: 4,
  6: 5,
  7: 5,
  8: 5},
 'score': 
 {0: np.nan,
  1: 3,
  2: 2,
  3: 5,
  4: np.nan,
  5: 6,
  6: 3,
  7: 2,
  8: 1},
  'binary': 
  {0: 1,
  1: 1,
  2: 0,
  3: 0,
  4: 0,
  5: 0,
  6: 1,
  7: 0,
  8: 1},
  'var1': 
  {0: 3,
  1: 5,
  2: 5,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 3,
  8: 5},
  'seq': 
  {0: 1,
  1: 2,
  2: 3,
  3: 1,
  4: 1,
  5: 2,
  6: 1,
  7: 2,
  8: 3},
})

The function I used is below

df['newnum'] = np.where(df.groupby('user')['seq'].max(), random.randint(4, 9), 'NA')

The shapes between the new column and old column are not the same, so I run into an error. I thought if I specify multiple conditions in np.where it would put "NA" in all of the places where it was not the max seq value, but this didn't happen.

Length of values (4) does not match length of index (9)

Anyone else have a better idea?

And, if possible, I'd ideally like for the newnum variable to be a multiple of the baseline (but that was too complicated, so I just created a random digit).

Thanks for any help!

CodePudding user response:

the groupby results in fewer rows and not matching 1:1 with your dataframe, hence the error.

Here is how you can accomplish it,

#using transform with the groupby to return the max against each of the items
#in the groupby

df['newnum']=np.where ( df.groupby('user')['seq'].transform('max').eq(df['seq']), 
          np.random.randint(4, 9),
         np.nan)
df
    user    baseline    score   binary  var1    seq     newnum
0   user122     4.0     NaN          1     3      1     NaN
1   user122     4.0     3.0          1     5      2     NaN
2   user122     4.0     2.0          0     5      3     6.0
3   user124     2.0     5.0          0     1      1     6.0
4   user125     4.0     NaN          0     1      1     NaN
5   user125     4.0     6.0          0     1      2     6.0
6   user126     5.0     3.0          1     1      1     NaN
7   user126     5.0     2.0          0     3      2     NaN
8   user126     5.0     1.0          1     5      3     6.0

CodePudding user response:

idxmax = df.groupby('user')['seq'].idxmax()
df.loc[idxmax, 'newnum'] = ...

Notes:

  • In the first line of the above code, we get indexes of df where maximum seq is reached for each user.
  • In the second line, we're creating a new columns newnum and assigning it at the same time to some values at the idxmax positions. Other values are NaN by default.
  • Related