i.e i want to drop duplicates pairs using col1 and col2 as the subset only if the values are the opposite in col3 (one negative and one positive). similar to drop_duplicates function but i want to impose a condition.
my dataset (df):
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
4 1 2 -1
5 1 2 1
6 1 2 1
I want:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
6 1 2 1
rows 4 and 5 are duplicated in col1 and col2 but and value in col3 is the opposite, therefore we remove both. row 0 and row 2 have duplicate values in col1 and col2 but col3 is the same, so we don't remove those rows.
i've tried using drop_duplicates but realised it wouldn't work as it will only remove duplicates and not consider anything else.
CodePudding user response:
We can do transform
out = df[df.groupby(['col1','col2']).col3.transform('sum').ne(0) & df.col3.ne(0)]
Out[252]:
col1 col2 col3
0 1 1 1
1 2 2 2
2 1 1 1
3 3 5 7
CodePudding user response:
Recreating the dataset:
import pandas as pd
data = [
[1, 1, 1],
[2, 2, 2],
[1, 1, 1],
[3, 5, 7],
[1, 2, -1],
[1, 2, 1],
[1, 2, 1],
]
df = pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
if your data is not massive, you can use an iterrows
function on a subset of the data.
The subset contains all duplicate values after all values have been turned into absolute values.
Next, we check if col3
is negative and if the opposite of col3
is in the duplicate subset.
If so, we drop the row from df
.
df_dupes = df[df.abs().duplicated(keep=False)]
df_dupes_list = df_dupes.to_numpy().tolist()
for i, row in df_dupes.iterrows():
if row.col3 < 0 and [row.col1, row.col2, -row.col3] in df_dupes_list:
df.drop(labels=i, axis=0, inplace=True)
This code should remove row 4.
In your desired output, you left row 5 for some reason.
If you can explain why you left row 5 but kept row 0, then I can adjust my code to more accurately match your desired output.