This is my data frame with 5 rows and 3 columns:
df = pd.DataFrame({'A': [1,4,4,3,7], 'B': [1,2,2,6,4], 'C': [1,2,2,6,4]})
I have to find a way to drop a row if the datapoint in column A is finding a value higher than itself in column A, and the B value of that row is lower than the B value of the querying row.
For example, in the above data frame row 4 has to be dropped because it has higher values in column A (4, 7) with less B value (2,4).
I will modify the question in a application perspective for better clarity. Sorry for my bad presentation skills.
Lets say this is our dataframe
df = pd.DataFrame({'resources': [100,200,300,300,400,400,400,500,1000],
'score': [1,2,1,2,3,5,6,8,9]})
I want to find a trade-off with resources i use and my score. My priority is to get the best score with less resources. I iterate all combinations and see if a row is eligible to be considered. So basically in this 9 rows, rows 3, 4, 5,6 should be eliminated 3 because 1 gives the same score with less resource, 4 because 2 gives the same score with less resource, 5 and 6 because 7 gives a better score with same resource. I hope this will make my problem more clear.
CodePudding user response:
Starting from:
A B C
0 1 1 1
1 4 2 2
2 4 2 2
3 3 6 6
4 7 4 4
You can do:
mask_a = df.A.apply(lambda x: (df.A > x).any())
mask_b = df.B.apply(lambda x: (df.B <= x).all())
df[~(mask_a & mask_b)]
# Output
A B C
0 1 1 1
1 4 2 2
2 4 2 2
4 7 4 4
CodePudding user response:
Doesn't grouping by the score and taking the minimum resource solves your problem?
min_resources = df.groupby('score').resources.min().reset_index()
max_scores = min_resources.groupby('resources').score.max().reset_index()
And the result is:
resources score
0 100 1
1 200 2
2 400 6
3 500 8
4 1000 9
CodePudding user response:
IIUC, you want to get the max score per resource and the min resource per score simultaneously.
You can compute both and get the intersection:
import numpy as np
idx1 = df.groupby('resources')['score'].idxmax().values
# array([0, 1, 3, 6, 7, 8])
idx2 = df.groupby('score')['resources'].idxmin().values
# array([0, 1, 4, 5, 6, 7, 8])
out = df.loc[np.intersect1d(idx1, idx2)]
output:
resources score
0 100 1
1 200 2
6 400 6
7 500 8
8 1000 9
used input:
df = pd.DataFrame({'resources': [100,200,300,300,400,400,400,500,1000],
'score': [1,2,1,2,3,5,6,8,9]})