Home > Back-end >  Modify duplicate rows with datetime
Modify duplicate rows with datetime

Time:05-11

I have a dataframe with id, purchase date, price of purchase and duration in days,

df
id       purchased_date      price       duration
1            2020-01-01      16.50          2
2            2020-01-01      24.00          4

What I'm trying to do is where ever the duration is greater than 1 day, I want the number of extra days to be split into duplicated rows, the price to be divided by the number of individual days and the date to increase by 1 day for each day purchased. Effectively giving me this,

df_new
id        purchased_date    price      duration
1           2020-01-01       8.25         1
1           2020-01-02       8.25         1
2           2020-01-01       6.00         1
2           2020-01-02       6.00         1
2           2020-01-03       6.00         1
2           2020-01-04       6.00         1

So far I've managed to duplicate the rows based on the duration using.

df['price'] = df['price']/df['duration']

df = df.loc[df.index.repeat(df.duration)]

and then I've tried using,

df.groupby(['id', 'purchased_date']).purchased_date.apply(lambda n: n   pd.to_timedelta(1, unit='d'))

however, this just gets stuck in an endless loop and I'm a bit stuck.

My plan is to put this all in a function but for now I just want to get the process working.

Thank you for any help.

CodePudding user response:

Use GroupBy.cumcount for counter, so possible pass to to_timedeltato_timedelta for days timedeltas and add to column purchased_date:

df['price'] = df['price']/df['duration']

df = df.loc[df.index.repeat(df.duration)].assign(duration=1)

df['purchased_date']  = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df = df.reset_index(drop=True)
print (df)
   id purchased_date  price  duration
0   1     2020-01-01   8.25         1
1   1     2020-01-02   8.25         1
2   2     2020-01-01   6.00         1
3   2     2020-01-02   6.00         1
4   2     2020-01-03   6.00         1
5   2     2020-01-04   6.00         1

CodePudding user response:

An approach with pandas.date_range and explode:

(df.assign(price=df['price'].div(df['duration']),
           purchased_date=df.apply(lambda x: pd.date_range(x['purchased_date'],
                                                           periods=x['duration']),
                                   axis=1),
           duration=1
          )
   .explode('purchased_date', ignore_index=True)
)

output:

   id purchased_date  price  duration
0   1     2020-01-01   8.25         1
1   1     2020-01-02   8.25         1
2   2     2020-01-01   6.00         1
3   2     2020-01-02   6.00         1
4   2     2020-01-03   6.00         1
5   2     2020-01-04   6.00         1

CodePudding user response:

Here is an easy to understand approach:

  1. Assign average 'price' value
  2. Create a temporary 'end_date' column
  3. Modify 'purchased_date' to form a list of date-time
  4. Explode 'purchased_date' to form new rows
  5. Delete the temporary 'end_date' column

Code:

df['price'] = df['price']/df['duration']
df['end_date'] = df.purchased_date   pd.to_timedelta(df.duration.sub(1), unit='d')
df['purchased_date'] = df.apply(lambda x: pd.date_range(start=x['purchased_date'], end=x['end_date']), axis=1)

df = df.explode('purchased_date').reset_index(drop=True)
del df['end_date']

print (df)



   id purchased_date  price  duration
0   1     2020-01-01   8.25         2
1   1     2020-01-02   8.25         2
2   2     2020-01-01   6.00         4
3   2     2020-01-02   6.00         4
4   2     2020-01-03   6.00         4
5   2     2020-01-04   6.00         4
  • Related