Home > Net >  Drop rows in pandas if multiple specific conditions are met
Drop rows in pandas if multiple specific conditions are met

Time:12-09

I have a pandas data frame that looks like this:

Group  ImportedFrom    Item  SomeDate
 X     Switzerland       A    1/4/2021
 X       US              A    2/1/2021
 Y       US              B    2/2/2021
 Y       US              C    3/1/2021
 Y       US              C    
 X       China           D    

I need to only remove rows where someDate is null/empty, only if the unique combination of "ImpotedFrom" & "Group" appears at least once in the dataframe.

The final result should look something like this:

Group  ImportedFrom    Item  SomeDate
 X     Switzerland       A    1/4/2021
 X       US              A    2/1/2021
 Y       US              B    2/2/2021
 Y       US              C    3/1/2021  
 X       China           D    

In this e.g., the 5th row was removed (as the combination of US concatenated with Group X appears more than once, and someDate is null). The last row is retained because China (Group X) only appears once in the entire dataframe though someDate is null.

Would appreciate any help on this, thanks!

CodePudding user response:

You can use drop_duplicates

Check my answer :

import numpy as np
import pandas as pd

data = {
    'Group':['X', 'X', 'Y','Y','Y','X'],
    'ImportedFrom':['Switzerland', 'US', 'US', 'US', 'US', 'China'],
    'Item':['A', 'A' , 'B','C','C','D'],
    'SomeDate':['1/4/2021', '2/1/2021' , '2/1/2021','3/1/2021', '',''],
    }
# Create the data frame
df = pd.DataFrame(data)

df1 = df.drop_duplicates(subset=['ImportedFrom','Item'])

print(df1)

Output :

  Group ImportedFrom Item  SomeDate
0     X  Switzerland    A  1/4/2021
1     X           US    A  2/1/2021
2     Y           US    B  2/1/2021
3     Y           US    C  3/1/2021
5     X        China    D    

CodePudding user response:

# Normalize nulls
df = df.replace({None: np.nan, '': np.nan})

i = df.set_index(['Group', 'ImportedFrom', 'Item']).index
df = df[~(i.duplicated() & df['SomeDate'].isna())]

Output:

>>> df
  Group ImportedFrom Item  SomeDate
0     X  Switzerland    A  1/4/2021
1     X           US    A  2/1/2021
2     Y           US    B  2/2/2021
3     Y           US    C  3/1/2021
5     X        China    D       NaN
  • Related