I have the following dataframe of students with their exam scores in different dates (sorted):
df = pd.DataFrame({'student': 'A A A A B B B C C D D'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
datetime.datetime(2014,6,2), datetime.datetime(2013,7,1),
datetime.datetime(2013,9,2),],
'score': [15, 22, 32, 20, 30, 38, 26, 18, 30, 33, 40]})
print(df)
student exam_date score
0 A 2013-04-01 15
1 A 2013-06-01 22
2 A 2013-07-01 32
3 A 2013-09-02 20
4 B 2013-10-01 30
5 B 2013-11-02 38
6 B 2014-02-02 26
7 C 2014-05-02 18
8 C 2014-06-02 30
9 D 2013-07-01 33
10 D 2013-09-02 40
I need to keep only those rows where the highest score is increased by more than 10 from the lowest score, otherwise drop them. Here, date is also important. The highest score has to be in the latter date than the previous date.
For example, for the student A
, the lowest score is 15
and the score is increased to 32
(latter in the date), so we're gonna keep that.
For the student B
, the lowest score is 26
, but there no score is increased after that. It is basically decreased, so we're gonna drop that.
For the student D
, the lowest score is 33
and the score is increased to 40
, Increase of only 7
, so we're gonna drop that.
I first tried df.groupby('student').agg({'score': np.ptp})
but it was tough to track if the score is decreased or increased.
Then I tried to use df.loc[df.groupby('student')['score'].idxmin()]
and df.loc[df.groupby('student')['score'].idxmax()]
to get the min and values, but not sure how I would compare the dates. Maybe I merge them and then compare, but it's too much of work.
Desired output:
student exam_date score
2 A 2013-07-01 32
8 C 2014-06-02 30
#--For A, highest score of 32 increased by 17 from lowest score of 15
#--For C, highest score of 30 increased by 12 from lowest score of 18
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
CodePudding user response:
Assuming your dataframe is already sorted by date:
highest_score = lambda x: x['score'].cummax() * (x['score'] > x['score'].shift()) \
- (x['score'].cummin()) >= 10
out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)
# Output:
student exam_date score
2 A 2013-07-01 32
8 C 2014-06-02 30
The expression * (x['score'] > x['score'].shift())
avoid cummax
to be propagated if the next value is lower than the current max.
CodePudding user response:
This question is somewhat confusing, but this works for your sample data:
subset = df.loc[df.groupby('student').apply(lambda x: x['score'].idxmax() if x.sort_values('exam_date')['score'].diff().max() >= 10 else None).dropna().astype(int)]
Output:
>>> subset
student exam_date score
2 A 2013-07-01 32
CodePudding user response:
So in your case first filter by the min point
df = df[df.index>=df.groupby('student')['score'].transform('idxmin')]
out = df[df.score - df.groupby('student').score.transform('min')>10]
Out[77]:
student exam_date score
2 A 2013-07-01 32
8 C 2014-06-02 30