Splitting dataframe of duplicates based on criteria-CodePudding

I have a dataframe of duplicated Email Addresses.

ID	EmailAddress	Name	Country	Distance	IDLen	NonNAN
39203920	[email protected]	John	UK	12	8	6
32323	[email protected]	NaN	UK	12	5	5

I have created two additional columns that calculate the length of the ID and how many non NaN fields are for each row. I would like to create 2 new dataframes:

df1. Where the duplicate row has either the higher NonNAN value; or if they're the same, pick the lowest IDLen.
df2. The remaining rows

I was thinking of using the df.duplicated() function but it only looks at first or last and I need something more sophisticated.

Thanks in advance.

CodePudding user response：

You can absolutely use duplicated, you just have to sort your data according to your conditions.

Given df:

         ID    EmailAddress    Name Country  Distance  IDLen  NonNAN
0  39203920  [email protected]    John      UK        12      8       6
1     32323  [email protected]     NaN      UK        12      5       5
2      2423   [email protected]     Bob     AUS        32      2       4
3     24233   [email protected]  Robert     AUS        32      2       5

Doing:

df = df.sort_values(['NonNAN', 'ID'], ascending=[False, True])
mask = df.duplicated('EmailAddress')
df1 = df[~mask]
df2 = df[mask]
print(df1)
print(df2)

# Output df1:

         ID    EmailAddress    Name Country  Distance  IDLen  NonNAN
0  39203920  [email protected]    John      UK        12      8       6
3     24233   [email protected]  Robert     AUS        32      2       5

# Output df2:

      ID    EmailAddress Name Country  Distance  IDLen  NonNAN
1  32323  [email protected]  NaN      UK        12      5       5
2   2423   [email protected]  Bob     AUS        32      2       4

CodePudding user response：

Will this work for you? The key is to sort the data, then apply df.duplicated(), which has very high efficiency rather than looping through each record like .apply(lambda) functions

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'ID': [39203920, 32323, 22222, 392999], 
    'EmailAddress': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'], 
    'Name': ['John', np.nan, 'Jane', 'John'], 
    'Country': ['UK', 'UK', 'UK', 'UK'], 
    'Distance': [12, 12, 12, 12], 
    'IDLen': [8, 5, 5, 6], 
    'NonNAN': [6, 5, 6, 6] })
df = df.sort_values(['EmailAddress', 'NonNAN', 'IDLen'], ascending=[True, False, True])

         ID    EmailAddress  Name Country  Distance  IDLen  NonNAN
2     22222     [email protected]  Jane      UK        12      5       6
1     32323     [email protected]   NaN      UK        12      5       5
3    392999  [email protected]  John      UK        12      6       6
0  39203920  [email protected]  John      UK        12      8       6

Based on your rules, I have sorted the data so that the desired record is located first. When df.duplicated() is applied on EmailAddress, the first record will be kept

df1 = df[~df.duplicated('EmailAddress')]
       ID    EmailAddress  Name Country  Distance  IDLen  NonNAN
2   22222     [email protected]  Jane      UK        12      5       6
3  392999  [email protected]  John      UK        12      6       6

df2 = df[df.duplicated('EmailAddress')]
         ID    EmailAddress  Name Country  Distance  IDLen  NonNAN
1     32323     [email protected]   NaN      UK        12      5       5
0  39203920  [email protected]  John      UK        12      8       6

If your ID column is numerical (ie, not alphanumeric), you can sort based on ascending ID, and there is no need for the column IDLen (because you would like the shortest one if 'NonNAN' is the same)

CodePudding user response：

You can create a boolean mask of rows you want to select for df1 - then automatically the inverse of this mask will select the rows for df2.

For each group where EmailAddress is the same:

mask selects all the rows where NonNAN has a maximum value.
if mask selects multiple values -

it must also select the rows where IDLen is the minimum.
if mask still selects multiple values -

just take the first one.

def f(df):
    mask = df['NonNAN'] == df['NonNAN'].max()
    if mask.sum() > 1:
        mask = mask & (df['IDLen'] == df.loc[mask, 'IDLen'].min())
        if mask.sum() > 1:
            mask.iloc[mask.argmax() 1 : ] = False
    return mask

mask = df.groupby('EmailAddress', group_keys=False).apply(f)
mask = mask.reindex(df.index)

df1 = df[mask]
df2 = df[~mask]