How do you filter duplicate columns in a dataframe based on a value in another column-CodePudding

I would like to filter duplicate rows in a DataFrame according to columns "NID", "Lact" and code when the column "Code" = 10.

The following data provides example data

data_list = {'NID': {1: '1', 2: '1', 3: '1', 4: '1', 5: '2', 6: '2', 7: '1'},
 'Lact': {1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 1},
            'Code': {1: 10, 2: 0, 3: 10, 4: 0, 5: 0, 6: 10, 7: 0}}

The DataFrame appears below

  NID Lact  Code
1   1   1   10
2   1   1   0
3   1   1   10
4   1   2   0
5   2   2   0
6   2   2   10
7   1   1   0

If I run the following filter to identify duplicates it identifies duplicate rows based on "NID", "Lact", and "Code"

df[(df.duplicated(['NID', 'Lact', 'Code'], keep=False))]

The output is provided below

  NID Lact  Code
1   1   1   10
2   1   1   0
3   1   1   10
7   1   1   0

I would like to make this filter conditional on Code = 10 as I would like to delete the first instance of duplicate rows when code = 10 but not when code is not equal to 10

Is there a way to add a condition for Code == 10 to this filter?

CodePudding user response：

If need remove first duplicated row if condition Code == 10 chain it with DataFrame.duplicated with default keep='first' parameter and if need also filter all duplicates chain m2 with & for bitwise AND:

m1 = df['Code'].eq(10)
m2 = df.duplicated(['NID', 'Lact', 'Code'], keep=False)
m3 = df.duplicated(['NID', 'Lact', 'Code'])


df = df[(~m1 | m3) & m2]
print (df)
  NID  Lact  Code
2   1     1     0
3   1     1    10
7   1     1     0

CodePudding user response：

IIUC, you want to keep all rows if Code is not equal to 10 but drop duplicates otherwise, right? Then you could add that into the boolean mask:

out = df[~df.duplicated(['NID', 'Lact', 'Code'], keep=False) | df['Code'].ne(10)]

Output:

  NID  Lact  Code
2   1     1     0
4   1     2     0
5   2     2     0
6   2     2    10
7   1     1     0