Cumsum by day of two dataframes considering repeated hours-CodePudding

I have the following two dataframes:

print(df_diff)
print(df_census_occupation)

                     pacients
2019-01-01 00:10:00         1
2019-01-01 00:20:00         1
2019-01-01 00:30:00        -1
2019-01-02 10:00:00         1
2019-01-02 11:30:00         1
2019-01-03 00:00:00        -1
2019-01-03 15:00:00        -1
2019-01-03 23:30:00        -1
2019-01-04 00:00:00         1
2019-01-04 00:00:00         1
2019-01-04 10:00:00        -1
2019-01-04 10:00:00        -1

            pacients_census
2019-01-01               10
2019-01-02               20
2019-01-03               30
2019-01-04               10

And I need to transform them into:

                     pacients
2019-01-01 00:00:00        10
2019-01-01 00:10:00        11
2019-01-01 00:20:00        12
2019-01-01 00:30:00        11
2019-01-02 00:00:00        20
2019-01-02 10:00:00        21
2019-01-02 11:30:00        22
2019-01-03 00:00:00        30
2019-01-03 00:00:00        29
2019-01-03 15:00:00        28
2019-01-03 23:30:00        27
2019-01-04 00:00:00        10
2019-01-04 00:00:00        11
2019-01-04 00:00:00        12
2019-01-04 10:00:00        11
2019-01-04 10:00:00        10

It's like a cumsum by day, where each day starts over again based on another dataframe (df_census_occupation). Attention must be taken to consider repeated hours, there may be days where we have exactly the same hour in df_diff, and such hours may also coincide with the start of the day in df_census_occupation. This is what happens in 2019-01-04 00:00:00 for example.

I tried using cumsum with masks and shifts, and also some groupby operations, but the code was becoming difficult to understand and it was not considering the repeated hours issue.

Auxiliary code to generate the two dataframes:

import datetime

df_diff_index = [
    "2019-01-01 00:10:00",
    "2019-01-01 00:20:00",
    "2019-01-01 00:30:00",
    "2019-01-02 10:00:00",
    "2019-01-02 11:30:00",
    "2019-01-03 00:00:00",
    "2019-01-03 15:00:00",
    "2019-01-03 23:30:00",
    "2019-01-04 00:00:00",
    "2019-01-04 00:00:00",
    "2019-01-04 10:00:00",
    "2019-01-04 10:00:00",
]
df_diff_index = [datetime.datetime.strptime(date, "%Y-%m-%d %H:%M:%S") for date in df_diff_index]

df_census_occupation_index = [
    "2019-01-01",
    "2019-01-02",
    "2019-01-03",
    "2019-01-04",
]
df_census_occupation_index = [datetime.datetime.strptime(date, "%Y-%m-%d") for date in df_census_occupation_index]

df_diff = pd.DataFrame({"pacients": [1, 1, -1, 1, 1, -1, -1, -1, 1, 1, -1, -1]}, index=df_diff_index)
df_census_occupation = pd.DataFrame({"pacients_census": [10, 20, 30, 10]}, index=df_census_occupation_index)

CodePudding user response：

Concatenate to data, sort by index, then groupby on the day and cumsum:

out = pd.concat([df_census_occupation.rename(columns={'pacients_census':'pacients'}), df_diff]
               ).sort_index().groupby(pd.Grouper(freq='D')).cumsum()

Output:

                     pacients
2019-01-01 00:00:00        10
2019-01-01 00:10:00        11
2019-01-01 00:20:00        12
2019-01-01 00:30:00        11
2019-01-02 00:00:00        20
2019-01-02 10:00:00        21
2019-01-02 11:30:00        22
2019-01-03 00:00:00        30
2019-01-03 00:00:00        29
2019-01-03 15:00:00        28
2019-01-03 23:30:00        27
2019-01-04 00:00:00        10
2019-01-04 00:00:00        11
2019-01-04 00:00:00        12
2019-01-04 10:00:00        11
2019-01-04 10:00:00        10

note you may want to pass kind='mergesort' to sort_index so the sort is stable, i.e. concensus goes before the data.