Home > Enterprise >  drop and extract duplicates in a comma separated & swapped values with pandas
drop and extract duplicates in a comma separated & swapped values with pandas

Time:09-16

It is a bit complicated for me to find the best approach to solve this:

I have a df:

   customer id                           emails country
0            1  [email protected],[email protected]      US
1            2              [email protected]      CA
2            3  [email protected],[email protected]      UK
3            4                   [email protected]      US
4            5                              NaN     NaN
5            6                [email protected]      UK

I want to detect duplicate customer ids based on the email ids in the emails column. For example, rows (2 & 5) are considered duplicates because the same email id [email protected] is shared .

I tried multiple approaches like this one:

df.join(df['emails'].apply(lambda x: pd.Series(str(x).split(','))), lsuffix="_left").drop_duplicates(subset=[0,1],keep=False).astype(str)

but the values changed between columns 0 & 1 interchangeably so the drop_duplicates did not work.

I used explode and drop_duplicates, it worked but did not yield the desired output:

df.join(df['emails'].apply(lambda x: str(x).split(',')), lsuffix="_left").explode(column='emails').drop_duplicates(subset='emails' , keep=False).groupby('customer id').agg({'emails':','.join,'country':'first'}).reset_index()

So this is the desired output:

   customer id                           emails country
0            1  [email protected],[email protected]      US
1            2              [email protected]      CA
2            4                   [email protected]      US
3            5                              NaN     NaN

and another df for the found duplicated values to know what customer id are flagged as a duplicate:

customer id         emails                                country     duplicate_id      duplicate_email
                
3                   [email protected],[email protected]         UK            6            [email protected]

Thanks for help

CodePudding user response:

# Split the emails to one per row, keeping the customer id
tmp = df[["customer id"]].assign(emails=df["emails"].str.split(",")).explode("emails")
tmp = (
    # Find the duplicate emails
    tmp.merge(tmp, on="emails", suffixes=("_a", "_b"))
    # Prevent a customer id from matching with itself. Also pick just one of the
    # duplicate pairing: as 3 is a duplicate of 6, 6 is also a duplicate of 3.
    # The "<" condition picks only the first pairing.
    .query("`customer id_a` < `customer id_b`")
    # Remove the emails column as we no longer need it
    .drop(columns="emails")
)

# Now that we have lined up the duplicate customer ids, it's only a matter of
# joining with the original data frame to get what you want
duplicated = (
    df.merge(tmp, left_on="customer id", right_on="customer id_a")
    .merge(df, left_on="customer id_b", right_on="customer id")
    [["customer id_x", "emails_x", "country_x", "customer id_y", "emails_y"]]
    .set_axis(
        ["customer id", "emails", "country", "duplicate_id", "duplicate_email"], axis=1
    )
)

# A customer is duplicated if they appear in the `duplicated` data frame
is_duplicated = (
    df["customer id"].isin(duplicated["customer id"])
    | df["customer id"].isin(duplicated["duplicate_id"])
)

# Select the non-duplicated
output = df[~is_duplicated]

CodePudding user response:

Here's a possible way to do this, making a new DataFrame with exploded emails, and then joining it to itself to find where the same emails occur with different customer values; this data can then be used to filter the original DataFrame:

df2 = df[['customer', 'emails']]
df2['emails'] = df2['emails'].str.split(',')
df2 = df2.explode('emails')
# now merge on email
dfm = df2.merge(df2, on='emails')
# find any duplicates with different customer
dupes = dfm[dfm['customer_x'] != dfm['customer_y']]['customer_x']
# filter the input
df[df['customer'].isin(dupes)]
df[~df['customer'].isin(dupes)]

Output:

   customer  id                           emails country
2         2   3  [email protected],[email protected]      UK
5         5   6                [email protected]      UK

   customer  id                           emails country
0         0   1  [email protected],[email protected]      US
1         1   2              [email protected]      CA
3         3   4                   [email protected]      US
4         4   5                              NaN     NaN
  • Related