I have a data frame as such:
data = [['123', 'Yes', 'No', 'No'], ['123', 'No', 'Yes', 'No'],['1234', 'No', 'Yes', 'No']]
df = pd.DataFrame(data, columns=['ID', 'Object_1', 'Object_2', 'Object_3'])
ID | Object_1 | Object_2 | Object_3 |
---|---|---|---|
123 | Yes | No | No |
123 | No | Yes | No |
1234 | No | Yes | No |
I want to group by the ID column though the values for Object_1, Object_2 and Object_3 may be different. If the value 'Yes' exists, I would like that remain in the final grouped dataframe.
Desired output would be a dataframe with the following values:
ID | Object_1 | Object_2 | Object_3 |
---|---|---|---|
123 | Yes | Yes | No |
1234 | No | Yes | No |
CodePudding user response:
You can take advantage of the fact that Yes
is lexicographically sorted after No
:
df.groupby('ID', as_index=False).max()
Output:
ID Object_1 Object_2 Object_3
0 123 Yes Yes No
1 1234 No Yes No
more robust/generic way
You can use an ordered Categorical type to handle any values, even more than two (e.g, No/Maybe/Yes):
# convert to Categorical
df.update(df.filter(like='Object')
.apply(pd.Categorical,
categories=['No', 'Yes'],
ordered=True))
# get max per group
df.groupby('ID', as_index=False).max()