I got a dataframe similiar to this:
Original:
# | col_a | col_b | ..
1 | 100725 | No | ..
2 | 100726 | Yes | ..
3 | 100726 | No | ..
4 | 100726 | No | ..
5 | 100727 | Yes | ..
6 | 100728 | No | ..
7 | 100728 | Yes | ..
8 | 100728 | Yes | ..
9 | 100729 | No | ..
10 | 100729 | No | ..
Result:
# | col_a | col_b | ..
1 | 100725 | No | ..
2 | 100726 | Yes | ..
5 | 100727 | Yes | ..
8 | 100728 | Yes | ..
9 | 100729 | No | ..
What I'm trying to achieve...
As shown above, I want to create a new dataframe with only unique numbers inside col_a, but with a twist: In col_b I want to receive a Yes, regardless if all or just one row contains this value. Only if there are only No-Values I want to receive a No in the final dataframe. Is there a clever way to achieve such a dataframe without creating a bunch of temporary columns for calculating.
Thanks for your help and have a great day!
finethen
CodePudding user response:
Sort col_b
values ('Yes' first then 'No') then keep first occurrence of col_b
and finally restore the index order:
>>> df.sort_values('col_b', ascending=False).drop_duplicates('col_a').sort_index()
col_a col_b
0 100725 No
1 100726 Yes
4 100727 Yes
6 100728 Yes
8 100729 No
Step by step:
>>> df.sort_values('col_b', ascending=False)
col_a col_b
1 100726 Yes
4 100727 Yes
6 100728 Yes
7 100728 Yes
0 100725 No
2 100726 No
3 100726 No
5 100728 No
8 100729 No
9 100729 No
>>> df.sort_values('col_b', ascending=False).drop_duplicates('col_a')
col_a col_b
1 100726 Yes
4 100727 Yes
6 100728 Yes
0 100725 No
8 100729 No
>>> df.sort_values('col_b', ascending=False).drop_duplicates('col_a').sort_index()
col_a col_b
0 100725 No
1 100726 Yes
4 100727 Yes
6 100728 Yes
8 100729 No
Setup:
data = {'col_a': [100725, 100726, 100726, 100726, 100727,
100728, 100728, 100728, 100729, 100729],
'col_b': ['No', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No']}
df = pd.DataFrame(data)
print(df)
# Output:
col_a col_b
0 100725 No
1 100726 Yes
2 100726 No
3 100726 No
4 100727 Yes
5 100728 No
6 100728 Yes
7 100728 Yes
8 100729 No
9 100729 No
CodePudding user response:
groupby, where there exists any Yes broadcat it and then drop duplicated.
df=df.assign(col_b=np.where(df.groupby('col_a')['col_b'].transform(lambda x:( x=='Yes').any()),'Yes', df['col_b'])).drop_duplicates(subset=['col_a','col_b'], keep='first')
print(df)
# col_a col_b
0 1 100725 No
1 2 100726 Yes
4 5 100727 Yes
5 6 100728 Yes
8 9 100729 No