I have a pandas data frame that looks like this:
Group ImportedFrom Item SomeDate
X Switzerland A 1/4/2021
X US A 2/1/2021
Y US B 2/2/2021
Y US C 3/1/2021
Y US C
X China D
I need to only remove rows where someDate is null/empty, only if the unique combination of "ImpotedFrom" & "Group" appears at least once in the dataframe.
The final result should look something like this:
Group ImportedFrom Item SomeDate
X Switzerland A 1/4/2021
X US A 2/1/2021
Y US B 2/2/2021
Y US C 3/1/2021
X China D
In this e.g., the 5th row was removed (as the combination of US concatenated with Group X appears more than once, and someDate is null). The last row is retained because China (Group X) only appears once in the entire dataframe though someDate is null.
Would appreciate any help on this, thanks!
CodePudding user response:
You can use drop_duplicates
Check my answer :
import numpy as np
import pandas as pd
data = {
'Group':['X', 'X', 'Y','Y','Y','X'],
'ImportedFrom':['Switzerland', 'US', 'US', 'US', 'US', 'China'],
'Item':['A', 'A' , 'B','C','C','D'],
'SomeDate':['1/4/2021', '2/1/2021' , '2/1/2021','3/1/2021', '',''],
}
# Create the data frame
df = pd.DataFrame(data)
df1 = df.drop_duplicates(subset=['ImportedFrom','Item'])
print(df1)
Output :
Group ImportedFrom Item SomeDate
0 X Switzerland A 1/4/2021
1 X US A 2/1/2021
2 Y US B 2/1/2021
3 Y US C 3/1/2021
5 X China D
CodePudding user response:
# Normalize nulls
df = df.replace({None: np.nan, '': np.nan})
i = df.set_index(['Group', 'ImportedFrom', 'Item']).index
df = df[~(i.duplicated() & df['SomeDate'].isna())]
Output:
>>> df
Group ImportedFrom Item SomeDate
0 X Switzerland A 1/4/2021
1 X US A 2/1/2021
2 Y US B 2/2/2021
3 Y US C 3/1/2021
5 X China D NaN