I'm trying to create a new column based on a groupby function, but I'm running into an error. In the sample dataframe below, I want to create a new column where there is a new integer only in rows correspond to the max seq
variable per user. So, for instance, user122 would only have a number in the 3rd row, where seq is 3 (this users highest seq number).
df = pd.DataFrame({
'user':
{0: 'user122',
1: 'user122',
2: 'user122',
3: 'user124',
4: 'user125',
5: 'user125',
6: 'user126',
7: 'user126',
8: 'user126'},
'baseline':
{0: 4.0,
1: 4.0,
2: 4.0,
3: 2,
4: 4,
5: 4,
6: 5,
7: 5,
8: 5},
'score':
{0: np.nan,
1: 3,
2: 2,
3: 5,
4: np.nan,
5: 6,
6: 3,
7: 2,
8: 1},
'binary':
{0: 1,
1: 1,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 0,
8: 1},
'var1':
{0: 3,
1: 5,
2: 5,
3: 1,
4: 1,
5: 1,
6: 1,
7: 3,
8: 5},
'seq':
{0: 1,
1: 2,
2: 3,
3: 1,
4: 1,
5: 2,
6: 1,
7: 2,
8: 3},
})
The function I used is below
df['newnum'] = np.where(df.groupby('user')['seq'].max(), random.randint(4, 9), 'NA')
The shapes between the new column and old column are not the same, so I run into an error. I thought if I specify multiple conditions in np.where
it would put "NA" in all of the places where it was not the max seq value, but this didn't happen.
Length of values (4) does not match length of index (9)
Anyone else have a better idea?
And, if possible, I'd ideally like for the newnum variable to be a multiple of the baseline
(but that was too complicated, so I just created a random digit).
Thanks for any help!
CodePudding user response:
the groupby results in fewer rows and not matching 1:1 with your dataframe, hence the error.
Here is how you can accomplish it,
#using transform with the groupby to return the max against each of the items
#in the groupby
df['newnum']=np.where ( df.groupby('user')['seq'].transform('max').eq(df['seq']),
np.random.randint(4, 9),
np.nan)
df
user baseline score binary var1 seq newnum
0 user122 4.0 NaN 1 3 1 NaN
1 user122 4.0 3.0 1 5 2 NaN
2 user122 4.0 2.0 0 5 3 6.0
3 user124 2.0 5.0 0 1 1 6.0
4 user125 4.0 NaN 0 1 1 NaN
5 user125 4.0 6.0 0 1 2 6.0
6 user126 5.0 3.0 1 1 1 NaN
7 user126 5.0 2.0 0 3 2 NaN
8 user126 5.0 1.0 1 5 3 6.0
CodePudding user response:
idxmax = df.groupby('user')['seq'].idxmax()
df.loc[idxmax, 'newnum'] = ...
Notes:
- In the first line of the above code, we get indexes of
df
where maximumseq
is reached for each user. - In the second line, we're creating a new columns
newnum
and assigning it at the same time to some values at theidxmax
positions. Other values areNaN
by default.