I have the following dataframe:
df dataframe:
item date_buy date_sell profit window
1 shoes 2009-12-04 2021-08-14 0.22 10
2 shoes 2009-12-05 2010-09-19 1.5 10
3 shoes 2015-05-05 2020-15-15 7.3 10
4 shoes 2009-12-09 2021-08-14 0.82 4
5 shoes 2009-12-10 2010-09-20 4.5 4
6 shoes 2015-05-11 2020-15-16 1.8 4
7 hat 2009-12-04 2021-08-14 1.2 10
8 hat 2009-12-05 2010-09-19 2.25 10
9 hat 2015-05-05 2020-15-15 4.3 10
10 hat 2009-12-09 2021-08-14 3.2 4
11 hat 2009-12-10 2010-09-20 9.4 4
12 hat 2015-05-11 2020-15-16 1.8 4
What I need to do is to resample the data until today using data_buy
as a key and separating the data by item
and window
. What I did is grouping my data by item
and window
, for each group I add the an extra row exactly as the last of the group changing only data_buy
field with today date and then resample but the execution is extremely slow since I have several thousands of data.
this is my code:
data = data.set_index(pd.to_datetime(data ['date_buy']))
resampled_data = data.groupby(['item', 'window']).apply(lambda x: resample(x, now())
def resample(df, today):
df = pd.concat([df, df[df.index==df.index.max()].rename(index={df.index.max(): pd.to_datetime(today)})])
df = df.asfreq('B', method='ffill')
return df
the result is correct and is the following (it's similar for the item hat):
df dataframe:
item date_buy date_sell profit window
1 shoes 2009-12-04 2021-08-14 0.22 10
2 shoes 2009-12-05 2010-09-19 1.5 10
.
2 shoes 2015-05-04 2010-09-19 1.5 10
3 shoes 2015-05-05 2020-15-15 7.3 10
.
.
3 shoes 2022-09-15 2020-15-15 7.3 10
4 shoes 2009-12-09 2021-08-14 0.82 4
5 shoes 2009-12-10 2010-09-20 4.5 4
.
5 shoes 2015-05-10 2010-09-20 4.5 4
6 shoes 2015-05-11 2020-15-16 1.8 4
.
.
6 shoes 2022-09-15 2020-15-16 1.8 4
This snippet takes about 30s to execute and I wanted to make it faster. Am I missing some pandas best practice to make it faster?
CodePudding user response:
This will work I think, although it will still not be super performant. Adding today's date each time seems to me like something that can't be vectorised (I may be wrong though). But using concat
is expensive, replacing it with reindex
is definitely going to gain you some performance.
data = data.set_index(pd.to_datetime(data['date_buy']))
def my_resample(grp):
return (grp
.reindex(grp.index.tolist() [pd.Timestamp.today().floor('D')])
.fillna(method='ffill')
.asfreq('B', method='ffill'))
(data
.groupby(['item', 'window'])
.apply(my_resample)
.reset_index(level=2)
.reset_index(drop=True)
)
CodePudding user response:
I might have a solution without .apply
:
First step - Create a dataframe end_data
that contains the closing date_buy
entries for each item
-window
group:
today = pd.Timestamp.today().floor('D')
end_data = (
data
.groupby(['item', 'window'], as_index=False)
.agg({'date_buy': lambda c: today})
)
For your sample that looks like:
item window date_buy
0 hat 4 2022-09-15
1 hat 10 2022-09-15
2 shoes 4 2022-09-15
3 shoes 10 2022-09-15
Second step:
data['date_buy'] = pd.to_datetime(data['date_buy']) # Just in case
data = (
pd.concat([data, end_data])
.set_index('date_buy', drop=True).sort_index()
.groupby(['item', 'window'], as_index=False).resample('B').ffill()
.fillna(method='ffill')
.droplevel(0).reset_index()
)
- Cast column
date_buy
intodatetime
s (might already be the case). - Append
end_data
at the end ofdata
. - Use column
date_buy
as index (drop the column), and then sort the index. The sorting is only necessary if theitem
-window
blocks ofdate_buy
aren't already in ascending order. - Now group the result by
item
-window
,.resample('B')
on the groups to upsample according to your requirement, and use.ffill
on the results. - Then fill the remaining
NaN
/NaT
s via forward fill. - Finally drop the first index level, and reset the upsampled
date_buy
-index as a column.
The result for your sample looks like:
date_buy item date_sell profit window
0 2009-12-09 hat 2021-08-14 3.2 4
1 2009-12-10 hat 2010-09-20 9.4 4
2 2009-12-11 hat 2010-09-20 9.4 4
3 2009-12-14 hat 2010-09-20 9.4 4
4 2009-12-15 hat 2010-09-20 9.4 4
... ... ... ... ... ...
13329 2022-09-09 shoes 2020-15-15 7.3 10
13330 2022-09-12 shoes 2020-15-15 7.3 10
13331 2022-09-13 shoes 2020-15-15 7.3 10
13332 2022-09-14 shoes 2020-15-15 7.3 10
13333 2022-09-15 shoes 2020-15-15 7.3 10
[13334 rows x 5 columns]
(Column date_sell
contains invalid dates.)