Filter Rows Based on Highest and Lowest Column Values-CodePudding

I have a unique dataframe:

df = pd.DataFrame({'student': 'A B C D'.split(),
                  'score1':[20, 15, 30, 22],
                   'score2': [15, 22, 35, 18],
                   'score3': [24, 32, 38, 25],
                   'score4': [20, 20, 26, 30]})

print(df)

  student  score1  score2  score3  score4
0       A      20      15      24      20
1       B      15      22      32      20
2       C      30      35      38      26
3       D      22      18      25      30

I need to keep only those rows where the highest score is increased by more than 10 from the lowest score, otherwise drop them.

For example, for the student A, the lowest score is 15 and after that the score is increased to 24 (by 9), so we're gonna drop that.

For the student B, the lowest score is 15 and the score is increased to 32, so we're gonna keep that.

For the student C, the lowest score is 26, but there no score is increased after that. It is basically decreased, so we're gonna drop that.

I know that diff() and ge() would be helpful here, but not sure how they would work when the lowest score (has to be on the left of highest score) and highest score (has to be on the right side of lowest score) are many columns apart.

Desired output:

name

B #--highest score of 32 (score3) increased by 17 from lowest score of 15 (score1)  
D #--highest score of 30 (score4) increased by 12 from lowest score of 18 (score2)

Any suggestions would be appreciated. Thanks!

CodePudding user response：

You can start by sorting your dataframe along the columns, so that your score columns are in the right order (score1 -> score4) using sort_index. Then, you can get the min(1) score of each student and the corresponding column where the minimum occured, using idxmin(1) (same approach for the maximum):

# Sort Index
df.sort_index(axis=1,inplace=True) 
sc = df.filter(like='score').columns

# Max score with corresponding column
ma = pd.concat([df[sc].idxmax(1),df[sc].max(1)],axis=1)
mi = pd.concat([df[sc].idxmin(1),df[sc].min(1)],axis=1)

Lastly, you can use boolean indexing to compare the first column of max with the first column of min which will show whether the max score happened after the min score, and also compare whether the difference between those scores is greater than 10:

df.loc[(ma[0] > mi[0]) & (ma[1]-mi[1] > 10)]

Which will return:

   score1  score2  score3  score4 student
1      15      22      32      20       B
3      22      18      25      30       D

CodePudding user response：

Try:

select_student = lambda x: x.sub(x.cummin()).gt(10).any()
out = df[df.filter(like='score').apply(select_student, axis=1)]
print(out)

# Output:
  student  score1  score2  score3  score4
1       B      15      22      32      20
3       D      22      18      25      30

CodePudding user response：

Something like this would return the rows you are interested in:

df.set_index("student", inplace=True)
res = df[(df - df.cummin(axis=1) > 10).any(axis=1)]

This would simply return the lines where at least one improvement greater than 10 has been achieved. To express the output in the literal form you expect, I think it is a little bit more challenging and we should agree what we are tracking:

the largest improvement (that is my basic assumption). In case of a tie (2 improvements of same size), the improvement from the highest minimum is returned; in case of improvement with same high-low (student E in the example below), the first occurence is returned
the first improvement greater than 10?
the improvement after the global lowest?
the improvement to the global highest - considering the lowest point before the global highest.

Now, I have expressed the logic in the following code - it is verbose but I wanted to get the point through (the one liner I previously posted was a little monster). I added a line to the DataFrame as well:

import pandas as pd

df = pd.DataFrame({'student': 'A B C D E'.split(),
                  'score1':[20, 15, 30, 22, 5],
                   'score2': [15, 22, 35, 18, 15],
                   'score3': [24, 32, 38, 25, 5],
                   'score4': [20, 20, 26, 30, 15]})

df.set_index("student", inplace=True)

def print_improvement(votes, target_improvement = 10):
    improvements = votes - votes.cummin(axis = 1)
    votes = votes[(improvements >= target_improvement).any(axis=1)]
    for student in votes.index:
        improved_score_idx = improvements.loc[student].idxmax()
        score = votes.loc[student, improved_score_idx]
        improvement = improvements.loc[student, improved_score_idx]
        prev_min = score - improvement
        min_score_idx = votes.columns[votes.loc[student] == prev_min][0]
        print(f'{student} #--highest score of {score} ({improved_score_idx}) improved by {improvement} from lowest score of {prev_min} ({min_score_idx})')
   
print_improvement(df)

OUTPUT

B #--highest score of 32 (score3) improved by 17 from lowest score of 15 (score1)
D #--highest score of 30 (score4) improved by 12 from lowest score of 18 (score2)
E #--highest score of 15 (score2) improved by 10 from lowest score of 5 (score1)