Remove duplicates that are in included in two columns in pandas-CodePudding

I have a dataframe that has two columns. I want to delete rows such that, for each row, it includes only one instance in the first column, but all unique values in column two are included.

Here is an example:

data = [[1,100],
      [1,101],
      [1,102],
      [1,103],
      [2,102],
      [2,104],
      [2,105],
      [3,102],
      [3,107]]

df = pd.DataFrame(data,columns = ['x', 'y'])

The data frame looks like this:

The output dataframe would look like this:

  x    y  inc
0  1  100    1
1  1  101    0
2  1  102    0
3  1  103    0
4  2  102    1
5  2  104    0
6  2  105    0
7  3  102    0
8  3  107    1

so row 0 would be included (inc), as 1 had not been duplicated yet in column x. Rows 1-3 would be excluded, as 1 in column x had already been accounted for. Row 4 would be included, as 2 in column x had not been included yet and column y (102) had not been included (it was excluded as a duplicate). At row 7, the first instance of 3 in column x would be excluded because 102 (in column y) had already been account for in row 4. Therefore, we would skip to row 8 and include it.

I have tried a variety of .duplicated approaches, but none of them have worked so far. If you only take the first instance of a value in column x, you would exclude rows that should be included (for example row 7).

Any help would be appreciated.

CodePudding user response：

I believe you would just use drop_duplicates with a subset of 'x' and specify 'first' as to which rows will be kept.

df.drop_duplicates(keep='first', subset = 'x')

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

CodePudding user response：

One way is to use a set and create custom function:

seen = set()

def func(d):
    res = d[~d.isin(seen)]
    if len(res):
        cur = res.iat[0]
        seen.add(cur)
        return cur

print (df.groupby("x")["y"].apply(func))

x
1    100
2    102
3    107
Name: y, dtype: int64