Remove duplicates based on multiple criteria-CodePudding

I have a df where there are duplicates that exist for one column. I don't want to simply just keep the first one. I want to take whichever one of those duplicates has a value in the 'verified' column. If there is multiple rows with a value in that column, just take the first row that has a verified value. If none of the duplicates have verified rows, then just keep the first duplicate. It is important that all of the rows have a name value. If they don't have a name value, just replace the NaN with the ID value. I also want to have a list of all of the IDs that don't have a name.

original

df_IDs=
name           ID       verified
joe            123      213
sally          123     
sarah          342     
mary           342      643
adam           342      932
sam            214         
jill           214
jack           992
NaN            379  
NaN            579

desired

df_IDs=
name           ID       verified
joe            123      213
mary           342      643
sam            214         
jack           992
ID/379         379   
ID/579         579         

no_match=[ID/379, ID/579]

CodePudding user response：

Find duplicate rows in a Dataframe based on all or selected columns.

# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated(keep = 'first')]

CodePudding user response：

Here's my best effort attempt at answering your question. The idea is to combine separate dataframes to derive the desired answer.

# Generate Data
data_dict = {
    'name': ['joe', 'sally', 'sarah', 'mary', 'adam', 'sam', 'jill', 'jack', np.nan, np.nan],
    'ID': [123, 123, 342, 342, 342, 214, 214, 992, 379, 579],
    'verified': [213, np.nan, np.nan, 643, 932, np.nan, np.nan, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(data_dict)

# First Part of the Answer
for idx, row in df.iterrows():
    if isinstance(row['name'], float):
        row = row.copy()
        df.at[idx, 'name'] = row['ID']
        
# Second Part of the Answer
a = df.dropna().drop_duplicates(subset=['ID'])
id_a = a['ID'].unique()
b = df[df['verified'].isna()].drop_duplicates(subset=['ID'])
c = b[b['ID'].apply(lambda x: x not in id_a)]

pd.concat([a,c])

First part of the answer addresses "If they don't have a name value, just replace the NaN with the ID value." and the second part of the answer addresses "I want to take whichever one of those duplicates has a value in the 'verified' column. If there is multiple rows with a value in that column, just take the first row that has a verified value. If none of the duplicates have verified rows, then just keep the first duplicate. It is important that all of the rows have a name value."

CodePudding user response：

Identify the missing names and print the IDs:

print(f"Missing names: {', '.join(df.loc[df['name'].isna(), 'ID'].astype(str))}")
# Missing names: 379, 579

fillna the names, then get the first non-NA (or first NA if none) per group:

df2 = df.assign(name=df['name'].fillna('ID/' df['ID'].astype(str)))

df2 = df2.loc[df2['verified'].notna().groupby(df['ID'], sort=False).idxmax()]

print(df2)

Output:

     name   ID  verified
0     joe  123     213.0
3    mary  342     643.0
5     sam  214       NaN
7    jack  992       NaN
8  ID/379  379       NaN
9  ID/579  579       NaN