I have a pandas dataframe in the below format
id name value_1 value_2
1 def 1 0
2 abc 0 1
I would need to sort the above dataframe based on id, name, value_1 & value_2. Following that, for every group of [id,name,value_1,value_2], get the first row and set df['result'] = 1. For the other rows in that group, set df['result'] = 0.
I do the sorting and get the first row using the below code:
df = df.sort_values(["id","name","value_1","value_2"], ascending=True)
first_row_per_group = df.groupby(["id","name","value_1","value_2"]).agg('first')
After getting the first row, I set first_row_per_group ['result'] = 1. But I am not sure how to set the other rows (non-first) rows to 0.
Any suggestions would be appreciated.
CodePudding user response:
duplicated
would be faster than groupby
:
df = df.sort_values(['id', 'name', 'value_1', 'value_2'])
df['result'] = (~df['id'].duplicated()).astype(int)
CodePudding user response:
use df.groupby(...).cumcount()
to get a counter of rows within the group which you can then manipulate.
In [51]: df
Out[51]:
a b c
0 def 1 0
1 abc 0 1
2 def 1 0
3 abc 0 1
In [52]: df2 = df.sort_values(['a','b','c'])
In [53]: df2['result'] = df2.groupby(['a', 'b', 'c']).cumcount()
In [54]: df2['result'] = np.where(df2['result'] == 0, 1, 0)
In [55]: df2
Out[55]:
a b c result
1 abc 0 1 1
3 abc 0 1 0
0 def 1 0 1
2 def 1 0 0