Python, Pandas: Drop datframe columns and combine rows with condition-CodePudding

I got a dataframe similiar to this:

Original:

#   |  col_a   |  col_b  |  ..
1   |  100725  |  No     |  ..
2   |  100726  |  Yes    |  ..
3   |  100726  |  No     |  ..
4   |  100726  |  No     |  ..
5   |  100727  |  Yes    |  ..
6   |  100728  |  No     |  ..
7   |  100728  |  Yes    |  ..
8   |  100728  |  Yes    |  ..
9   |  100729  |  No     |  ..
10  |  100729  |  No     |  ..

Result:

#   |  col_a   |  col_b  |  ..
1   |  100725  |  No     |  ..
2   |  100726  |  Yes    |  ..
5   |  100727  |  Yes    |  ..
8   |  100728  |  Yes    |  ..
9   |  100729  |  No     |  ..

What I'm trying to achieve...

As shown above, I want to create a new dataframe with only unique numbers inside col_a, but with a twist: In col_b I want to receive a Yes, regardless if all or just one row contains this value. Only if there are only No-Values I want to receive a No in the final dataframe. Is there a clever way to achieve such a dataframe without creating a bunch of temporary columns for calculating.

Thanks for your help and have a great day!

finethen

CodePudding user response：

Sort col_b values ('Yes' first then 'No') then keep first occurrence of col_b and finally restore the index order:

>>> df.sort_values('col_b', ascending=False).drop_duplicates('col_a').sort_index()

    col_a col_b
0  100725    No
1  100726   Yes
4  100727   Yes
6  100728   Yes
8  100729    No

Step by step:

>>> df.sort_values('col_b', ascending=False)
    col_a col_b
1  100726   Yes
4  100727   Yes
6  100728   Yes
7  100728   Yes
0  100725    No
2  100726    No
3  100726    No
5  100728    No
8  100729    No
9  100729    No

>>> df.sort_values('col_b', ascending=False).drop_duplicates('col_a')
    col_a col_b
1  100726   Yes
4  100727   Yes
6  100728   Yes
0  100725    No
8  100729    No

>>> df.sort_values('col_b', ascending=False).drop_duplicates('col_a').sort_index()
    col_a col_b
0  100725    No
1  100726   Yes
4  100727   Yes
6  100728   Yes
8  100729    No

Setup:

data = {'col_a': [100725, 100726, 100726, 100726, 100727,
                  100728, 100728, 100728, 100729, 100729],
        'col_b': ['No', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No']}
df = pd.DataFrame(data)
print(df)

# Output:
    col_a col_b
0  100725    No
1  100726   Yes
2  100726    No
3  100726    No
4  100727   Yes
5  100728    No
6  100728   Yes
7  100728   Yes
8  100729    No
9  100729    No

CodePudding user response：

groupby, where there exists any Yes broadcat it and then drop duplicated.

df=df.assign(col_b=np.where(df.groupby('col_a')['col_b'].transform(lambda x:( x=='Yes').any()),'Yes', df['col_b'])).drop_duplicates(subset=['col_a','col_b'], keep='first')

print(df)



#   col_a col_b
0  1  100725    No
1  2  100726   Yes
4  5  100727   Yes
5  6  100728   Yes
8  9  100729    No