Be the following python pandas DataFrame:
| num_ID | start_date | end_date | other_column |
| ------ | ----------- | ---------- | ----------------- |
| 1 | 2022-02-14 | 2022-02-15 | 09:23:00 |
| 1 | 2022-02-20 | 2022-02-25 | 12:10:01 |
| 2 | 2022-03-11 | 2022-03-21 | 08:21:00 |
| 2 | 2022-03-22 | 2022-03-27 | 02:36:00 |
| 2 | 2022-04-10 | 2022-04-15 | 11:43:03 |
| 3 | 2022-02-04 | 2022-02-06 | 16:51:00 |
| 3 | 2022-02-14 | 2022-02-23 | 19:35:10 |
| 3 | 2022-03-05 | 2022-03-06 | 21:05:13 |
| 4 | 2022-02-28 | 2022-10-12 | 00:01:00 |
For each num_ID
value, I want to add a new column diff_weeks
with a cumulative value if the end_date
value of the previous row and the start_date
value from the actual one are at least 7 days apart.
First add the value 0 for this column for the first row of each num_ID
value, since they do not have a previous row.
| num_ID | start_date | end_date | other_column | diff_weeks |
| ------ | ----------- | ---------- | ----------------- | ----------- |
| 1 | 2022-02-14 | 2022-02-15 | 09:23:00 | 0 |
| 1 | 2022-02-20 | 2022-02-25 | 12:10:01 | |
| 2 | 2022-03-11 | 2022-03-21 | 08:21:00 | 0 |
| 2 | 2022-03-22 | 2022-03-27 | 02:36:00 | |
| 2 | 2022-04-10 | 2022-04-15 | 11:43:03 | |
| 3 | 2022-02-04 | 2022-02-06 | 16:51:00 | 0 |
| 3 | 2022-02-14 | 2022-02-23 | 19:35:10 | |
| 3 | 2022-03-05 | 2022-03-06 | 21:05:13 | |
| 4 | 2022-02-28 | 2022-10-12 | 00:01:00 | 0 |
Finally we fill in the remaining rows with the condition, if the end_date
value of the previous row (with the same num_ID
) is at least 7 days away from the current start_date
. The new value is the value of the previous row increments by 1.
| num_ID | start_date | end_date | other_column | diff_weeks |
| ------ | ----------- | ---------- | ----------------- | ----------- |
| 1 | 2022-02-14 | 2022-02-15 | 09:23:00 | 0 |
| 1 | 2022-02-20 | 2022-02-25 | 12:10:01 | 0 |
| 2 | 2022-03-11 | 2022-03-21 | 08:21:00 | 0 |
| 2 | 2022-03-22 | 2022-03-27 | 02:36:00 | 0 |
| 2 | 2022-04-10 | 2022-04-15 | 11:43:03 | 1 |
| 3 | 2022-02-04 | 2022-02-06 | 16:51:00 | 0 |
| 3 | 2022-02-14 | 2022-02-23 | 19:35:10 | 1 |
| 3 | 2022-03-05 | 2022-03-06 | 21:05:13 | 2 |
| 4 | 2022-02-28 | 2022-10-12 | 00:01:00 | 0 |
@mozway proposed this code, but it does not do the cumulative sum and gives me the following error: TypeError: unsupported operand type(s) for -: 'datetime.date' and 'float'
df['diff_weeks'] = (df['start_date'].sub(df.groupby('num_ID')['end_date'].shift())
.ge('7d').astype(int)
)
CodePudding user response:
This question has two problems:
- The TypeError
TypeError: unsupported operand type(s) for -: 'datetime.date' and 'float'
- The cumulative sum missing.
Problem 1
It seems that the columns start_date
and end_date
are not a pandas datetime object. In order to do that conversion, use pandas.to_datetime
as follows
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
With that, one can run mozway's solution
df['diff_weeks'] = (df['start_date'].sub(df.groupby('num_ID')['end_date'].shift())
.ge('7d').astype(int)
)
[Out]:
num_ID start_date end_date other_column diff_weeks
0 1 2022-02-14 2022-02-15 09:23:00 0
1 1 2022-02-20 2022-02-25 12:10:01 0
2 2 2022-03-11 2022-03-21 08:21:00 0
3 2 2022-03-22 2022-03-27 02:36:00 0
4 2 2022-04-10 2022-04-15 11:43:03 1
5 3 2022-02-04 2022-02-06 16:51:00 0
6 3 2022-02-14 2022-02-23 19:35:10 1
7 3 2022-03-05 2022-03-06 21:05:13 1
8 4 2022-02-28 2022-10-12 00:01:00 0
Just tried here and it worked.
Problem 2
One can group by num_id
and, using the column created before diff_weeks
, use .cumsum()
as follows
df['diff_weeks'] = df.groupby('num_ID')['diff_weeks'].cumsum()
[Out]:
num_ID start_date end_date other_column diff_weeks
0 1 2022-02-14 2022-02-15 09:23:00 0
1 1 2022-02-20 2022-02-25 12:10:01 0
2 2 2022-03-11 2022-03-21 08:21:00 0
3 2 2022-03-22 2022-03-27 02:36:00 0
4 2 2022-04-10 2022-04-15 11:43:03 1
5 3 2022-02-04 2022-02-06 16:51:00 0
6 3 2022-02-14 2022-02-23 19:35:10 1
7 3 2022-03-05 2022-03-06 21:05:13 2
8 4 2022-02-28 2022-10-12 00:01:00 0
Notes:
- For the future, would suggest reading this: One post with multiple questions or multiple posts?