Home > other >  Filter Dataframe Based on Highest and Lowest Row Values with Increasing Timeline
Filter Dataframe Based on Highest and Lowest Row Values with Increasing Timeline

Time:12-22

I have the following dataframe of students with their exam scores in different dates (sorted):

df = pd.DataFrame({'student': 'A A A A B B B C C D D'.split(),
                  'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
                               datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
                               datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
                               datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
                               datetime.datetime(2014,6,2), datetime.datetime(2013,7,1),
                               datetime.datetime(2013,9,2),],
                   'score': [15, 22, 32, 20, 30, 38, 26, 18, 30, 33, 40]})

print(df)

   student  exam_date  score
0        A 2013-04-01     15
1        A 2013-06-01     22
2        A 2013-07-01     32
3        A 2013-09-02     20
4        B 2013-10-01     30
5        B 2013-11-02     38
6        B 2014-02-02     26
7        C 2014-05-02     18
8        C 2014-06-02     30
9        D 2013-07-01     33
10       D 2013-09-02     40

I need to keep only those rows where the highest score is increased by more than 10 from the lowest score, otherwise drop them. Here, date is also important. The highest score has to be in the latter date than the previous date.

For example, for the student A, the lowest score is 15 and the score is increased to 32 (latter in the date), so we're gonna keep that.

For the student B, the lowest score is 26, but there no score is increased after that. It is basically decreased, so we're gonna drop that.

For the student D, the lowest score is 33 and the score is increased to 40, Increase of only 7, so we're gonna drop that.

I first tried df.groupby('student').agg({'score': np.ptp}) but it was tough to track if the score is decreased or increased.

Then I tried to use df.loc[df.groupby('student')['score'].idxmin()] and df.loc[df.groupby('student')['score'].idxmax()] to get the min and values, but not sure how I would compare the dates. Maybe I merge them and then compare, but it's too much of work.

Desired output:

student exam_date   score
2   A   2013-07-01  32
8   C   2014-06-02  30

#--For A, highest score of 32 increased by 17 from lowest score of 15  
#--For C, highest score of 30 increased by 12 from lowest score of 18 

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response:

Assuming your dataframe is already sorted by date:

highest_score = lambda x: x['score'].cummax() * (x['score'] > x['score'].shift()) \
                          - (x['score'].cummin()) >= 10

out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)

# Output:
  student  exam_date  score
2       A 2013-07-01     32
8       C 2014-06-02     30

The expression * (x['score'] > x['score'].shift()) avoid cummax to be propagated if the next value is lower than the current max.

CodePudding user response:

This question is somewhat confusing, but this works for your sample data:

subset = df.loc[df.groupby('student').apply(lambda x: x['score'].idxmax() if x.sort_values('exam_date')['score'].diff().max() >= 10 else None).dropna().astype(int)]

Output:

>>> subset
  student  exam_date  score
2       A 2013-07-01     32

CodePudding user response:

So in your case first filter by the min point

df = df[df.index>=df.groupby('student')['score'].transform('idxmin')]
out = df[df.score - df.groupby('student').score.transform('min')>10]
Out[77]: 
  student  exam_date  score
2       A 2013-07-01     32
8       C 2014-06-02     30
  • Related