I have a list of tasks with start and end date and i need to know how many open tasks i had on a specific day.
My current approach for a solution is a cross join with a calendar table:
import pandas as pd
df = pd.DataFrame({
'ID': [101, 102, 103],
'Start' : ['01.01.2022', '02.01.2022', '03.01.2022'],
'End' : ['03.01.2022', '08.01.2022', '05.01.2022']
})
df['Start'] = pd.to_datetime(df['Start'], format="%d.%m.%Y")
df['Ende'] = pd.to_datetime(df['End'], format="%d.%m.%Y")
calender = pd.DataFrame({
'Day': pd.date_range(start=df['Start'].min(), end=df['End'].max())
})
df = pd.merge(left=df, right=calender, how='cross')
df = df.loc[(df['Day'] >= df['Start']) & (df['Day'] <= df['End'])]
df.pivot_table(index='Day', aggfunc='count', values='ID')
This solution works well on a small set of data, but my original data has 500k items and calendar helptable with 2k items. So the cross join leads to 1 billion rows, what seems to be very inefficent and slows down the system.
Is there a better way to solve this with pandas?
CodePudding user response:
You can use .apply
that calls function where you increase range of DateTimeIndex
by 1
. For example:
df = pd.DataFrame(
{
"ID": [101, 102, 103],
"Start": ["01.01.2022", "02.01.2022", "03.01.2022"],
"End": ["03.01.2022", "08.01.2022", "05.01.2022"],
}
)
df["Start"] = pd.to_datetime(df["Start"], format="%d.%m.%Y")
df["End"] = pd.to_datetime(df["End"], format="%d.%m.%Y")
out = pd.DataFrame(
{"Num Tasks": 0},
index=pd.date_range(start=df["Start"].min(), end=df["End"].max()),
)
def increase_tasks(row):
out.loc[row["Start"] : row["End"]] = 1
df.apply(increase_tasks, axis=1)
print(out)
Prints:
Num Tasks
2022-01-01 1
2022-01-02 2
2022-01-03 3
2022-01-04 2
2022-01-05 2
2022-01-06 1
2022-01-07 1
2022-01-08 1