I would like to filter duplicate rows in a DataFrame according to columns "NID", "Lact" and code when the column "Code" = 10.
The following data provides example data
data_list = {'NID': {1: '1', 2: '1', 3: '1', 4: '1', 5: '2', 6: '2', 7: '1'},
'Lact': {1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 1},
'Code': {1: 10, 2: 0, 3: 10, 4: 0, 5: 0, 6: 10, 7: 0}}
The DataFrame appears below
NID Lact Code
1 1 1 10
2 1 1 0
3 1 1 10
4 1 2 0
5 2 2 0
6 2 2 10
7 1 1 0
If I run the following filter to identify duplicates it identifies duplicate rows based on "NID", "Lact", and "Code"
df[(df.duplicated(['NID', 'Lact', 'Code'], keep=False))]
The output is provided below
NID Lact Code
1 1 1 10
2 1 1 0
3 1 1 10
7 1 1 0
I would like to make this filter conditional on Code = 10
as I would like to delete the first instance of duplicate rows when code = 10 but not when code is not equal to 10
Is there a way to add a condition for Code == 10
to this filter?
CodePudding user response:
If need remove first duplicated row if condition Code == 10
chain it with DataFrame.duplicated
with default keep='first'
parameter and if need also filter all duplicates chain m2
with &
for bitwise AND
:
m1 = df['Code'].eq(10)
m2 = df.duplicated(['NID', 'Lact', 'Code'], keep=False)
m3 = df.duplicated(['NID', 'Lact', 'Code'])
df = df[(~m1 | m3) & m2]
print (df)
NID Lact Code
2 1 1 0
3 1 1 10
7 1 1 0
CodePudding user response:
IIUC, you want to keep all rows if Code is not equal to 10 but drop duplicates otherwise, right? Then you could add that into the boolean mask:
out = df[~df.duplicated(['NID', 'Lact', 'Code'], keep=False) | df['Code'].ne(10)]
Output:
NID Lact Code
2 1 1 0
4 1 2 0
5 2 2 0
6 2 2 10
7 1 1 0