Eliminate rows in a dataframe after looping through all rows of column and applying condition-CodePudding

This is my data frame with 5 rows and 3 columns:

df = pd.DataFrame({'A': [1,4,4,3,7], 'B': [1,2,2,6,4], 'C': [1,2,2,6,4]})

I have to find a way to drop a row if the datapoint in column A is finding a value higher than itself in column A, and the B value of that row is lower than the B value of the querying row.

For example, in the above data frame row 4 has to be dropped because it has higher values in column A (4, 7) with less B value (2,4).

I will modify the question in a application perspective for better clarity. Sorry for my bad presentation skills.

Lets say this is our dataframe

df = pd.DataFrame({'resources': [100,200,300,300,400,400,400,500,1000],
                   'score': [1,2,1,2,3,5,6,8,9]})

I want to find a trade-off with resources i use and my score. My priority is to get the best score with less resources. I iterate all combinations and see if a row is eligible to be considered. So basically in this 9 rows, rows 3, 4, 5,6 should be eliminated 3 because 1 gives the same score with less resource, 4 because 2 gives the same score with less resource, 5 and 6 because 7 gives a better score with same resource. I hope this will make my problem more clear.

CodePudding user response：

Starting from:

    A   B   C
0   1   1   1
1   4   2   2
2   4   2   2
3   3   6   6
4   7   4   4

You can do:

mask_a = df.A.apply(lambda x: (df.A > x).any())
mask_b = df.B.apply(lambda x: (df.B <= x).all())

df[~(mask_a & mask_b)]

# Output
    A   B   C
0   1   1   1
1   4   2   2
2   4   2   2
4   7   4   4

CodePudding user response：

Doesn't grouping by the score and taking the minimum resource solves your problem?

min_resources = df.groupby('score').resources.min().reset_index()
max_scores = min_resources.groupby('resources').score.max().reset_index()

And the result is:

    resources   score
0   100         1
1   200         2
2   400         6
3   500         8
4   1000        9

CodePudding user response：

IIUC, you want to get the max score per resource and the min resource per score simultaneously.

You can compute both and get the intersection:

import numpy as np

idx1 = df.groupby('resources')['score'].idxmax().values
# array([0, 1, 3, 6, 7, 8])

idx2 = df.groupby('score')['resources'].idxmin().values
# array([0, 1, 4, 5, 6, 7, 8])

out = df.loc[np.intersect1d(idx1, idx2)]

output:

   resources  score
0        100      1
1        200      2
6        400      6
7        500      8
8       1000      9

used input:

df = pd.DataFrame({'resources': [100,200,300,300,400,400,400,500,1000],
                   'score': [1,2,1,2,3,5,6,8,9]})