I have the following DataFrame, with over 3 million rows:
VALID_FROM VALID_TO VALUE
0 2022-01-01 2022-01-02 5
1 2022-01-01 2022-01-03 2
2 2022-01-02 2022-01-04 7
3 2022-01-03 2022-01-06 3
I want to create one large date_range with a sum of the values for each timestamp. For the DataFrame above that would come out to:
dates val
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3
However, as the DataFrame has a little over 3 Million rows I don't wan't to iterate over each row and I'm not sure how to do this without iterating. Any suggestions?
Currently my code looks like this:
new_df = pd.DataFrame()
for idx, row in dummy_df.iterrows():
dr = pd.date_range(row["VALID_FROM"], end = row["VALID_TO"], freq = "D")
tmp_df = pd.DataFrame({"dates": dr, "val": row["VALUE"]})
new_df = pd.concat(objs=[new_df, tmp_df], ignore_index=True)
new_df.groupby("dates", as_index=False, group_keys=False).sum()
The result of the groupby would be my desired output.
CodePudding user response:
If performance is important use Index.repeat
with DataFrame.loc
for new rows, create date
colun with counter by GroupBy.cumcount
and last aggregate sum
:
df['VALID_FROM'] = pd.to_datetime(df['VALID_FROM'])
df['VALID_TO'] = pd.to_datetime(df['VALID_TO'])
df1 = df.loc[df.index.repeat(df['VALID_TO'].sub(df['VALID_FROM']).dt.days 1)]
df1['dates'] = df1['VALID_FROM'] pd.to_timedelta(df1.groupby(level=0).cumcount(),unit='d')
df1 = df1.groupby('dates', as_index=False)['VALUE'].sum()
print (df1)
dates VALUE
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3
CodePudding user response:
One option is to build a list of dates, from the min to the max from the original dataframe, use a non-equi join with conditional_join to get matches, and finally groupby and sum:
# pip install pyjanitor
import pandas as pd
import janitor
# build the date pandas object:
dates = df.filter(like='VALID').to_numpy()
dates = pd.date_range(dates.min(), dates.max(), freq='1D')
dates = pd.Series(dates, name='dates')
# compute the inequality join between valid_from and valid_to,
# followed by the aggregation on a groupby:
(df
.conditional_join(
dates,
('VALID_FROM', 'dates', '<='),
('VALID_TO','dates', '>='),
# if you have numba installed,
# it can improve performance
use_numba=False,
df_columns='VALUE')
.groupby('dates')
.VALUE
.sum()
)
dates
2022-01-01 7
2022-01-02 14
2022-01-03 12
2022-01-04 10
2022-01-05 3
2022-01-06 3
Name: VALUE, dtype: int64