Cumulatively increment based on a threshold between a current date and that of the previous row in a-CodePudding

Be the following python pandas DataFrame:

| num_ID | start_date  | end_date   | other_column      |
| ------ | ----------- | ---------- | ----------------- |
| 1      | 2022-02-14  | 2022-02-15 | 09:23:00          |
| 1      | 2022-02-20  | 2022-02-25 | 12:10:01          |
| 2      | 2022-03-11  | 2022-03-21 | 08:21:00          |
| 2      | 2022-03-22  | 2022-03-27 | 02:36:00          |
| 2      | 2022-04-10  | 2022-04-15 | 11:43:03          |
| 3      | 2022-02-04  | 2022-02-06 | 16:51:00          |
| 3      | 2022-02-14  | 2022-02-23 | 19:35:10          |
| 3      | 2022-03-05  | 2022-03-06 | 21:05:13          |
| 4      | 2022-02-28  | 2022-10-12 | 00:01:00          |

For each num_ID value, I want to add a new column diff_weeks with a cumulative value if the end_date value of the previous row and the start_date value from the actual one are at least 7 days apart.

First add the value 0 for this column for the first row of each num_ID value, since they do not have a previous row.

| num_ID | start_date  | end_date   | other_column      | diff_weeks  |
| ------ | ----------- | ---------- | ----------------- | ----------- |
| 1      | 2022-02-14  | 2022-02-15 | 09:23:00          | 0           |
| 1      | 2022-02-20  | 2022-02-25 | 12:10:01          |             |
| 2      | 2022-03-11  | 2022-03-21 | 08:21:00          | 0           |
| 2      | 2022-03-22  | 2022-03-27 | 02:36:00          |             |
| 2      | 2022-04-10  | 2022-04-15 | 11:43:03          |             |
| 3      | 2022-02-04  | 2022-02-06 | 16:51:00          | 0           |
| 3      | 2022-02-14  | 2022-02-23 | 19:35:10          |             |
| 3      | 2022-03-05  | 2022-03-06 | 21:05:13          |             |
| 4      | 2022-02-28  | 2022-10-12 | 00:01:00          | 0           |

Finally we fill in the remaining rows with the condition, if the end_date value of the previous row (with the same num_ID) is at least 7 days away from the current start_date. The new value is the value of the previous row increments by 1.

| num_ID | start_date  | end_date   | other_column      | diff_weeks  |
| ------ | ----------- | ---------- | ----------------- | ----------- |
| 1      | 2022-02-14  | 2022-02-15 | 09:23:00          | 0           |
| 1      | 2022-02-20  | 2022-02-25 | 12:10:01          | 0           |
| 2      | 2022-03-11  | 2022-03-21 | 08:21:00          | 0           |
| 2      | 2022-03-22  | 2022-03-27 | 02:36:00          | 0           |
| 2      | 2022-04-10  | 2022-04-15 | 11:43:03          | 1           |
| 3      | 2022-02-04  | 2022-02-06 | 16:51:00          | 0           |
| 3      | 2022-02-14  | 2022-02-23 | 19:35:10          | 1           |
| 3      | 2022-03-05  | 2022-03-06 | 21:05:13          | 2           |
| 4      | 2022-02-28  | 2022-10-12 | 00:01:00          | 0           |

@mozway proposed this code, but it does not do the cumulative sum and gives me the following error: TypeError: unsupported operand type(s) for -: 'datetime.date' and 'float'

df['diff_weeks'] = (df['start_date'].sub(df.groupby('num_ID')['end_date'].shift())
                     .ge('7d').astype(int)
                    )

CodePudding user response：

This question has two problems:

The TypeError

TypeError: unsupported operand type(s) for -: 'datetime.date' and 'float'

The cumulative sum missing.

Problem 1

It seems that the columns start_date and end_date are not a pandas datetime object. In order to do that conversion, use pandas.to_datetime as follows

df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

With that, one can run mozway's solution

df['diff_weeks'] = (df['start_date'].sub(df.groupby('num_ID')['end_date'].shift())
                     .ge('7d').astype(int)
                    )

[Out]:
   num_ID start_date   end_date other_column  diff_weeks
0       1 2022-02-14 2022-02-15     09:23:00           0
1       1 2022-02-20 2022-02-25     12:10:01           0
2       2 2022-03-11 2022-03-21     08:21:00           0
3       2 2022-03-22 2022-03-27     02:36:00           0
4       2 2022-04-10 2022-04-15     11:43:03           1
5       3 2022-02-04 2022-02-06     16:51:00           0
6       3 2022-02-14 2022-02-23     19:35:10           1
7       3 2022-03-05 2022-03-06     21:05:13           1
8       4 2022-02-28 2022-10-12     00:01:00           0

Just tried here and it worked.

Problem 2

One can group by num_id and, using the column created before diff_weeks, use .cumsum() as follows

df['diff_weeks'] = df.groupby('num_ID')['diff_weeks'].cumsum()

[Out]:

   num_ID start_date   end_date other_column  diff_weeks
0       1 2022-02-14 2022-02-15     09:23:00           0
1       1 2022-02-20 2022-02-25     12:10:01           0
2       2 2022-03-11 2022-03-21     08:21:00           0
3       2 2022-03-22 2022-03-27     02:36:00           0
4       2 2022-04-10 2022-04-15     11:43:03           1
5       3 2022-02-04 2022-02-06     16:51:00           0
6       3 2022-02-14 2022-02-23     19:35:10           1
7       3 2022-03-05 2022-03-06     21:05:13           2
8       4 2022-02-28 2022-10-12     00:01:00           0

Notes:

For the future, would suggest reading this: One post with multiple questions or multiple posts?