Apologies if this may be a duplicated question, I have not had any luck solving my issue with guidance of semi-similar posts.
I have a df with columns ID and current_stage, using Python
I want to go through and find the duplicated values in ID, and of those that are duplicated check to see if they have a 1 or 2 for the current stage. if they only have 1 or 2, then I only need one record of that ID. If there is a 3 or 4 in an instance of a duplicated ID, I want to keep all of the records of that duplicated ID.
I appreciate any help the gods of stack overflow have!
Thank you
CodePudding user response:
I may have a way to go around..
You divide the data into two dataframes, drop the duplicates from one and merge them again as follow:
df1 = df[df['current_stage'].isin([1,2])]
df2 = df[~df['current_stage'].isin([1,2])]
df1.drop_duplicates(subset=['ID'], inplace = True)
df = pd.concat([df1, df2])
CodePudding user response:
Given the dataframe below:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3], 'current_stage': [1, 1, 2, 3, 3, 4, 4, 1, 2, 2, 4]})
ID current_stage
0 1 1
1 1 1
2 1 2
3 1 3
4 1 3
5 2 4
6 2 4
7 2 1
8 2 2
9 2 2
10 3 4
You can do:
out = df[df.groupby('ID')['current_stage'].transform(np.size)>1].groupby('current_stage').apply(lambda x: x.iloc[0].to_frame().T if x.iloc[0]['current_stage'] in [1,2] else x).reset_index(drop=True)
Output:
ID current_stage
0 1 1
1 1 2
2 1 3
3 1 3
4 2 4
5 2 4