I have a dataframe of duplicated Email Addresses.
ID | EmailAddress | Name | Country | Distance | IDLen | NonNAN |
---|---|---|---|---|---|---|
39203920 | [email protected] | John | UK | 12 | 8 | 6 |
32323 | [email protected] | NaN | UK | 12 | 5 | 5 |
I have created two additional columns that calculate the length of the ID and how many non NaN fields are for each row. I would like to create 2 new dataframes:
df1. Where the duplicate row has either the higher
NonNAN
value; or if they're the same, pick the lowestIDLen
.df2. The remaining rows
I was thinking of using the df.duplicated()
function but it only looks at first or last and I need something more sophisticated.
Thanks in advance.
CodePudding user response:
You can absolutely use duplicated
, you just have to sort your data according to your conditions.
Given df
:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4
3 24233 [email protected] Robert AUS 32 2 5
Doing:
df = df.sort_values(['NonNAN', 'ID'], ascending=[False, True])
mask = df.duplicated('EmailAddress')
df1 = df[~mask]
df2 = df[mask]
print(df1)
print(df2)
# Output df1:
ID EmailAddress Name Country Distance IDLen NonNAN
0 39203920 [email protected] John UK 12 8 6
3 24233 [email protected] Robert AUS 32 2 5
# Output df2:
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
2 2423 [email protected] Bob AUS 32 2 4
CodePudding user response:
Will this work for you? The key is to sort the data, then apply df.duplicated()
, which has very high efficiency rather than looping through each record like .apply(lambda)
functions
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': [39203920, 32323, 22222, 392999],
'EmailAddress': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Name': ['John', np.nan, 'Jane', 'John'],
'Country': ['UK', 'UK', 'UK', 'UK'],
'Distance': [12, 12, 12, 12],
'IDLen': [8, 5, 5, 6],
'NonNAN': [6, 5, 6, 6] })
df = df.sort_values(['EmailAddress', 'NonNAN', 'IDLen'], ascending=[True, False, True])
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
1 32323 [email protected] NaN UK 12 5 5
3 392999 [email protected] John UK 12 6 6
0 39203920 [email protected] John UK 12 8 6
Based on your rules, I have sorted the data so that the desired record is located first. When df.duplicated()
is applied on EmailAddress
, the first record will be kept
df1 = df[~df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
2 22222 [email protected] Jane UK 12 5 6
3 392999 [email protected] John UK 12 6 6
df2 = df[df.duplicated('EmailAddress')]
ID EmailAddress Name Country Distance IDLen NonNAN
1 32323 [email protected] NaN UK 12 5 5
0 39203920 [email protected] John UK 12 8 6
If your ID
column is numerical (ie, not alphanumeric), you can sort based on ascending ID
, and there is no need for the column IDLen
(because you would like the shortest one if 'NonNAN' is the same)
CodePudding user response:
You can create a boolean mask of rows you want to select for df1
- then automatically the inverse of this mask will select the rows for df2
.
For each group where EmailAddress
is the same:
mask
selects all the rows whereNonNAN
has a maximum value.if
mask
selects multiple values -it must also select the rows where
IDLen
is the minimum.if
mask
still selects multiple values -just take the first one.
def f(df):
mask = df['NonNAN'] == df['NonNAN'].max()
if mask.sum() > 1:
mask = mask & (df['IDLen'] == df.loc[mask, 'IDLen'].min())
if mask.sum() > 1:
mask.iloc[mask.argmax() 1 : ] = False
return mask
mask = df.groupby('EmailAddress', group_keys=False).apply(f)
mask = mask.reindex(df.index)
df1 = df[mask]
df2 = df[~mask]