I have a dataframe with id, purchase date, price of purchase and duration in days,
df
id purchased_date price duration
1 2020-01-01 16.50 2
2 2020-01-01 24.00 4
What I'm trying to do is where ever the duration is greater than 1 day, I want the number of extra days to be split into duplicated rows, the price to be divided by the number of individual days and the date to increase by 1 day for each day purchased. Effectively giving me this,
df_new
id purchased_date price duration
1 2020-01-01 8.25 1
1 2020-01-02 8.25 1
2 2020-01-01 6.00 1
2 2020-01-02 6.00 1
2 2020-01-03 6.00 1
2 2020-01-04 6.00 1
So far I've managed to duplicate the rows based on the duration using.
df['price'] = df['price']/df['duration']
df = df.loc[df.index.repeat(df.duration)]
and then I've tried using,
df.groupby(['id', 'purchased_date']).purchased_date.apply(lambda n: n pd.to_timedelta(1, unit='d'))
however, this just gets stuck in an endless loop and I'm a bit stuck.
My plan is to put this all in a function but for now I just want to get the process working.
Thank you for any help.
CodePudding user response:
Use GroupBy.cumcount
for counter, so possible pass to to_timedelta
to_timedelta for days timedeltas and add to column purchased_date
:
df['price'] = df['price']/df['duration']
df = df.loc[df.index.repeat(df.duration)].assign(duration=1)
df['purchased_date'] = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df = df.reset_index(drop=True)
print (df)
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1
CodePudding user response:
An approach with pandas.date_range
and explode
:
(df.assign(price=df['price'].div(df['duration']),
purchased_date=df.apply(lambda x: pd.date_range(x['purchased_date'],
periods=x['duration']),
axis=1),
duration=1
)
.explode('purchased_date', ignore_index=True)
)
output:
id purchased_date price duration
0 1 2020-01-01 8.25 1
1 1 2020-01-02 8.25 1
2 2 2020-01-01 6.00 1
3 2 2020-01-02 6.00 1
4 2 2020-01-03 6.00 1
5 2 2020-01-04 6.00 1
CodePudding user response:
Here is an easy to understand approach:
- Assign average 'price' value
- Create a temporary 'end_date' column
- Modify 'purchased_date' to form a list of date-time
- Explode 'purchased_date' to form new rows
- Delete the temporary 'end_date' column
Code:
df['price'] = df['price']/df['duration']
df['end_date'] = df.purchased_date pd.to_timedelta(df.duration.sub(1), unit='d')
df['purchased_date'] = df.apply(lambda x: pd.date_range(start=x['purchased_date'], end=x['end_date']), axis=1)
df = df.explode('purchased_date').reset_index(drop=True)
del df['end_date']
print (df)
id purchased_date price duration
0 1 2020-01-01 8.25 2
1 1 2020-01-02 8.25 2
2 2 2020-01-01 6.00 4
3 2 2020-01-02 6.00 4
4 2 2020-01-03 6.00 4
5 2 2020-01-04 6.00 4