So I have this type of dataset while working in python with pandas:
id pos result
0 1 1 AB
1 1 1 --
2 1 1 BC
3 1 1 AB
4 1 2 CA
5 2 1 CA
6 2 2 --
7 2 2 BA
8 3 1 --
9 3 1 --
Desired result:
id pos result
0 1 1 AB
2 1 1 BC
3 1 1 AB
4 1 2 CA
5 2 1 CA
7 2 2 BA
8 3 1 --
9 3 1 --
I would like to drop duplicate rows on [id] AND [pos] where the result are '--', but only if there already exist a result on the same [id] AND [pos] that has a A, B or C combination.
If no A, B or C combination exist on the same [id] AND [pos], then I want to keep the result '--'.
I know how to drop duplicate rows, but this problem is really beyond my knowledge and I am hoping for your help.
CodePudding user response:
You can use masks and boolean indexing:
# is the result not a "--"?
m = df['result'].ne('--')
# is there at least a non "--" in the group?
m2 = (m
.groupby([df['id'], df['pos']])
.transform('max')
)
# keep if both conditions are equal
out = df[m==m2]
Alternative for the last step:
# keep if not "--" or not a non "--" in group
out = df[m|~m2]
Output:
id pos result
0 1 1 AB
2 1 1 BC
3 1 1 AB
4 1 2 CA
5 2 1 CA
7 2 2 BA
8 3 1 --
9 3 1 --