It is a bit complicated for me to find the best approach to solve this:
I have a df:
customer id emails country
0 1 [email protected],[email protected] US
1 2 [email protected] CA
2 3 [email protected],[email protected] UK
3 4 [email protected] US
4 5 NaN NaN
5 6 [email protected] UK
I want to detect duplicate customer ids
based on the email ids in the emails
column. For example, rows (2 & 5) are considered duplicates because the same email id [email protected]
is shared .
I tried multiple approaches like this one:
df.join(df['emails'].apply(lambda x: pd.Series(str(x).split(','))), lsuffix="_left").drop_duplicates(subset=[0,1],keep=False).astype(str)
but the values changed between columns 0 & 1 interchangeably so the drop_duplicates
did not work.
I used explode
and drop_duplicates
, it worked but did not yield the desired output:
df.join(df['emails'].apply(lambda x: str(x).split(',')), lsuffix="_left").explode(column='emails').drop_duplicates(subset='emails' , keep=False).groupby('customer id').agg({'emails':','.join,'country':'first'}).reset_index()
So this is the desired output:
customer id emails country
0 1 [email protected],[email protected] US
1 2 [email protected] CA
2 4 [email protected] US
3 5 NaN NaN
and another df for the found duplicated values to know what customer id are flagged as a duplicate:
customer id emails country duplicate_id duplicate_email
3 [email protected],[email protected] UK 6 [email protected]
Thanks for help
CodePudding user response:
# Split the emails to one per row, keeping the customer id
tmp = df[["customer id"]].assign(emails=df["emails"].str.split(",")).explode("emails")
tmp = (
# Find the duplicate emails
tmp.merge(tmp, on="emails", suffixes=("_a", "_b"))
# Prevent a customer id from matching with itself. Also pick just one of the
# duplicate pairing: as 3 is a duplicate of 6, 6 is also a duplicate of 3.
# The "<" condition picks only the first pairing.
.query("`customer id_a` < `customer id_b`")
# Remove the emails column as we no longer need it
.drop(columns="emails")
)
# Now that we have lined up the duplicate customer ids, it's only a matter of
# joining with the original data frame to get what you want
duplicated = (
df.merge(tmp, left_on="customer id", right_on="customer id_a")
.merge(df, left_on="customer id_b", right_on="customer id")
[["customer id_x", "emails_x", "country_x", "customer id_y", "emails_y"]]
.set_axis(
["customer id", "emails", "country", "duplicate_id", "duplicate_email"], axis=1
)
)
# A customer is duplicated if they appear in the `duplicated` data frame
is_duplicated = (
df["customer id"].isin(duplicated["customer id"])
| df["customer id"].isin(duplicated["duplicate_id"])
)
# Select the non-duplicated
output = df[~is_duplicated]
CodePudding user response:
Here's a possible way to do this, making a new DataFrame with exploded emails, and then joining it to itself to find where the same emails occur with different customer
values; this data can then be used to filter the original DataFrame:
df2 = df[['customer', 'emails']]
df2['emails'] = df2['emails'].str.split(',')
df2 = df2.explode('emails')
# now merge on email
dfm = df2.merge(df2, on='emails')
# find any duplicates with different customer
dupes = dfm[dfm['customer_x'] != dfm['customer_y']]['customer_x']
# filter the input
df[df['customer'].isin(dupes)]
df[~df['customer'].isin(dupes)]
Output:
customer id emails country
2 2 3 [email protected],[email protected] UK
5 5 6 [email protected] UK
customer id emails country
0 0 1 [email protected],[email protected] US
1 1 2 [email protected] CA
3 3 4 [email protected] US
4 4 5 NaN NaN