I have a df where there are duplicates that exist for one column. I don't want to simply just keep the first one. I want to take whichever one of those duplicates has a value in the 'verified' column. If there is multiple rows with a value in that column, just take the first row that has a verified value. If none of the duplicates have verified rows, then just keep the first duplicate. It is important that all of the rows have a name value. If they don't have a name value, just replace the NaN with the ID value. I also want to have a list of all of the IDs that don't have a name.
original
df_IDs=
name ID verified
joe 123 213
sally 123
sarah 342
mary 342 643
adam 342 932
sam 214
jill 214
jack 992
NaN 379
NaN 579
desired
df_IDs=
name ID verified
joe 123 213
mary 342 643
sam 214
jack 992
ID/379 379
ID/579 579
no_match=[ID/379, ID/579]
CodePudding user response:
Find duplicate rows in a Dataframe based on all or selected columns.
# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated(keep = 'first')]
CodePudding user response:
Here's my best effort attempt at answering your question. The idea is to combine separate dataframes to derive the desired answer.
# Generate Data
data_dict = {
'name': ['joe', 'sally', 'sarah', 'mary', 'adam', 'sam', 'jill', 'jack', np.nan, np.nan],
'ID': [123, 123, 342, 342, 342, 214, 214, 992, 379, 579],
'verified': [213, np.nan, np.nan, 643, 932, np.nan, np.nan, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(data_dict)
# First Part of the Answer
for idx, row in df.iterrows():
if isinstance(row['name'], float):
row = row.copy()
df.at[idx, 'name'] = row['ID']
# Second Part of the Answer
a = df.dropna().drop_duplicates(subset=['ID'])
id_a = a['ID'].unique()
b = df[df['verified'].isna()].drop_duplicates(subset=['ID'])
c = b[b['ID'].apply(lambda x: x not in id_a)]
pd.concat([a,c])
First part of the answer addresses "If they don't have a name value, just replace the NaN with the ID value." and the second part of the answer addresses "I want to take whichever one of those duplicates has a value in the 'verified' column. If there is multiple rows with a value in that column, just take the first row that has a verified value. If none of the duplicates have verified rows, then just keep the first duplicate. It is important that all of the rows have a name value."
CodePudding user response:
Identify the missing names and print the IDs:
print(f"Missing names: {', '.join(df.loc[df['name'].isna(), 'ID'].astype(str))}")
# Missing names: 379, 579
fillna
the names, then get the first non-NA (or first NA if none) per group:
df2 = df.assign(name=df['name'].fillna('ID/' df['ID'].astype(str)))
df2 = df2.loc[df2['verified'].notna().groupby(df['ID'], sort=False).idxmax()]
print(df2)
Output:
name ID verified
0 joe 123 213.0
3 mary 342 643.0
5 sam 214 NaN
7 jack 992 NaN
8 ID/379 379 NaN
9 ID/579 579 NaN