Get the amount of open tasks for any specific date with Python/Pandas-CodePudding

I have a list of tasks with start and end date and i need to know how many open tasks i had on a specific day.

My current approach for a solution is a cross join with a calendar table:

import pandas as pd

df = pd.DataFrame({
    'ID': [101, 102, 103],
    'Start' : ['01.01.2022', '02.01.2022', '03.01.2022'],
    'End' : ['03.01.2022', '08.01.2022', '05.01.2022']
})

df['Start'] = pd.to_datetime(df['Start'], format="%d.%m.%Y")
df['Ende'] = pd.to_datetime(df['End'], format="%d.%m.%Y")

calender = pd.DataFrame({
    'Day': pd.date_range(start=df['Start'].min(), end=df['End'].max())
})

df = pd.merge(left=df, right=calender, how='cross')
df = df.loc[(df['Day'] >= df['Start']) & (df['Day'] <= df['End'])]

df.pivot_table(index='Day', aggfunc='count', values='ID')

This solution works well on a small set of data, but my original data has 500k items and calendar helptable with 2k items. So the cross join leads to 1 billion rows, what seems to be very inefficent and slows down the system.

Is there a better way to solve this with pandas?

CodePudding user response：

You can use .apply that calls function where you increase range of DateTimeIndex by 1. For example:

df = pd.DataFrame(
    {
        "ID": [101, 102, 103],
        "Start": ["01.01.2022", "02.01.2022", "03.01.2022"],
        "End": ["03.01.2022", "08.01.2022", "05.01.2022"],
    }
)

df["Start"] = pd.to_datetime(df["Start"], format="%d.%m.%Y")
df["End"] = pd.to_datetime(df["End"], format="%d.%m.%Y")


out = pd.DataFrame(
    {"Num Tasks": 0},
    index=pd.date_range(start=df["Start"].min(), end=df["End"].max()),
)


def increase_tasks(row):
    out.loc[row["Start"] : row["End"]]  = 1


df.apply(increase_tasks, axis=1)
print(out)

Prints:

            Num Tasks
2022-01-01          1
2022-01-02          2
2022-01-03          3
2022-01-04          2
2022-01-05          2
2022-01-06          1
2022-01-07          1
2022-01-08          1