I have a pandas dataframe with information about students and test dates. I would like to create a variable that takes on a new value for each student, but also takes on a new value for the same student if 5 years have passed without a test attempt. The desired column is "group" below. How can I do this in python?
Student test_date group
Bob 1995 1
Bob 1997 1
Bob 2020 2
Bob 2020 2
Mary 2020 3
Mary 2021 3
Mary 2021 3
The initial, very clunky idea I had was to sort by name, sort by date, calculate the difference in date, have an ind if diff > 5, and then somehow number by groups.
ds = pd.read_excel('../students.xlsx')
ds = ds.sort_values(by=['student','test_date'])
ds['time'] = ds['test_date'].diff()
ds['break'] = 0
ds.loc[(ds['time'] > 5),'break'] = 1
Student test_date time break
Bob 1995 na na
Bob 1997 2 0
Bob 2020 23 1
Bob 2020 0 0
Mary 2020 na na
Mary 2021 1 0
Mary 2021 0 0
CodePudding user response:
df = df.sort_values(["Student", "test_date"])
((df.Student != df.Student.shift()) | (df.test_date.diff().gt(5))).cumsum()
# 0 1
# 1 1
# 2 2
# 3 2
# 4 3
# 5 3
# 6 3
# dtype: int32