I have a dataframe that has two columns. I want to delete rows such that, for each row, it includes only one instance in the first column, but all unique values in column two are included.
Here is an example:
data = [[1,100],
[1,101],
[1,102],
[1,103],
[2,102],
[2,104],
[2,105],
[3,102],
[3,107]]
df = pd.DataFrame(data,columns = ['x', 'y'])
The data frame looks like this:
x y
0 1 100
1 1 101
2 1 102
3 1 103
4 2 102
5 2 104
6 2 105
7 3 102
8 3 107
The output dataframe would look like this:
x y inc
0 1 100 1
1 1 101 0
2 1 102 0
3 1 103 0
4 2 102 1
5 2 104 0
6 2 105 0
7 3 102 0
8 3 107 1
so row 0 would be included (inc), as 1 had not been duplicated yet in column x. Rows 1-3 would be excluded, as 1 in column x had already been accounted for. Row 4 would be included, as 2 in column x had not been included yet and column y (102) had not been included (it was excluded as a duplicate). At row 7, the first instance of 3 in column x would be excluded because 102 (in column y) had already been account for in row 4. Therefore, we would skip to row 8 and include it.
I have tried a variety of .duplicated
approaches, but none of them have worked so far. If you only take the first instance of a value in column x, you would exclude rows that should be included (for example row 7).
Any help would be appreciated.
CodePudding user response:
I believe you would just use drop_duplicates
with a subset of 'x'
and specify 'first'
as to which rows will be kept.
df.drop_duplicates(keep='first', subset = 'x')
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
CodePudding user response:
One way is to use a set
and create custom function:
seen = set()
def func(d):
res = d[~d.isin(seen)]
if len(res):
cur = res.iat[0]
seen.add(cur)
return cur
print (df.groupby("x")["y"].apply(func))
x
1 100
2 102
3 107
Name: y, dtype: int64