Looping through df and using a conditional to remove unneeded rows-CodePudding

Apologies if this may be a duplicated question, I have not had any luck solving my issue with guidance of semi-similar posts.

I have a df with columns ID and current_stage, using Python

I want to go through and find the duplicated values in ID, and of those that are duplicated check to see if they have a 1 or 2 for the current stage. if they only have 1 or 2, then I only need one record of that ID. If there is a 3 or 4 in an instance of a duplicated ID, I want to keep all of the records of that duplicated ID.

I appreciate any help the gods of stack overflow have!

Thank you

CodePudding user response：

I may have a way to go around..

You divide the data into two dataframes, drop the duplicates from one and merge them again as follow:

df1 = df[df['current_stage'].isin([1,2])]
df2 = df[~df['current_stage'].isin([1,2])]
df1.drop_duplicates(subset=['ID'], inplace = True)
df = pd.concat([df1, df2])

CodePudding user response：

Given the dataframe below:

df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3], 'current_stage': [1, 1, 2, 3, 3, 4, 4, 1, 2, 2, 4]})

    ID  current_stage
0    1              1
1    1              1
2    1              2
3    1              3
4    1              3
5    2              4
6    2              4
7    2              1
8    2              2
9    2              2
10   3              4

You can do:

out = df[df.groupby('ID')['current_stage'].transform(np.size)>1].groupby('current_stage').apply(lambda x: x.iloc[0].to_frame().T if x.iloc[0]['current_stage'] in [1,2] else x).reset_index(drop=True)

Output:

   ID  current_stage
0   1              1
1   1              2
2   1              3
3   1              3
4   2              4
5   2              4