Home > Back-end >  Upsampling and dividing data in pandas
Upsampling and dividing data in pandas

Time:10-18

I am trying to upsample a pandas datetime-indexed dataframe, so that resulting data is equally divided over the new entries.

For instance, let's say I have a dataframe which stores a cost each month, and I want to get a dataframe which summarizes the equivalent costs per day for each month:

df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31], 
                    [pd.to_datetime('2023-02-01'), 14]], 
                    columns=['time', 'cost']
                    )
      .set_index("time")
     )

Daily costs are 1$ (or whatever currency you like) in January, and 0.5$ in February. My goal in picture:

enter image description here

After a lot of struggle, I managed to obtain the next code snippet which seems to do what I want:

# add a value to perform a correct resampling
df.loc[df.index.max()   relativedelta(months=1)] = 0

# forward-fill over the right scale
# then divide each entry per the number of rows in the month
df = (df
      .resample('1d')
      .ffill()
      .iloc[:-1]
      .groupby(lambda x: datetime(x.year, x.month, 1))
      .transform(lambda x: (x / x.count()))
      )

However, this is not entirely ok:

  • using transform forces me to have dataframes with a single column ;
  • I need to hardcode my original frequency several times in different formats (while adding an extra value at the end of the dataframe, and in the groupby), making a function design hard ;
  • It only works with evenly-spaced datetime index (even if it's ok in my case) ;
  • it remains complex.

Does anyone have a suggestion to improve that code snippet ?

CodePudding user response:

What if we took df's month indices and expanded them into days range, while dividing df's values by a number those days and assigning to each day, all by list comprehensions (edit: for equally distributed values per day):

import pandas as pd

# initial DataFrame
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31], 
                    [pd.to_datetime('2023-02-01'), 14]], 
                   columns=['time', 'cost']
       ).set_index("time"))

# reformat to months
df.index = df.index.strftime('%m-%Y')

df1 = pd.concat(  # concatenate the resulted DataFrames into one
    [pd.DataFrame(  # make a DataFrame from a row in df 
        [v / pd.Period(i).days_in_month  # each month's value divided by n of days in a month
         for d in range(pd.Period(i).days_in_month)],  # repeated for as many times as there are days
        index=pd.date_range(start=i, periods=pd.Period(i).days_in_month, freq='D'))  # days range
     for i, v in df.iterrows()])  # for each df's index and value
df1

Output:

           cost
2023-01-01  1.0
2023-01-02  1.0
2023-01-03  1.0
2023-01-04  1.0
2023-01-05  1.0
2023-01-06  1.0
2023-01-07  1.0
2023-01-08  1.0
2023-01-09  1.0
2023-01-10  1.0
2023-01-11  1.0
    ...     ...
2023-02-13  0.5
2023-02-14  0.5
2023-02-15  0.5
2023-02-16  0.5
2023-02-17  0.5
2023-02-18  0.5
2023-02-19  0.5
2023-02-20  0.5
2023-02-21  0.5
2023-02-22  0.5
2023-02-23  0.5
2023-02-24  0.5
2023-02-25  0.5
2023-02-26  0.5
2023-02-27  0.5
2023-02-28  0.5

What could be done to avoid uniform distribution of daily costs and for the cases with multiple columns? Here's an extended df:

# additional columns and a row
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31, 62, 23], 
                    [pd.to_datetime('2023-02-01'), 14, 28, 51],
                    [pd.to_datetime('2023-03-01'), 16, 33, 21]], 
                   columns=['time', 'cost1', 'cost2', 'cost3']
       ).set_index("time"))

# reformat to months
df.index = df.index.strftime('%m-%Y')
df

Output:

          cost1  cost2  cost3
time            
01-2023   31     62     23
02-2023   14     28     51
03-2023   16     33     21

Here's what I came up for the cases where monthly costs may be upsampled by randomized daily costs, inspired by this question. This solution is scalable to the number of columns and rows:

df1 = pd.concat(  # concatenate the resulted DataFrames into one
    [pd.DataFrame(  # make a DataFrame from a row in df 
        # here we make a Series with random Dirichlet distributed numbers
        # with length of a month and a column's value as the sum 
        [pd.Series((np.random.dirichlet(np.ones(pd.Period(i).days_in_month), size=1)*v
                   ).flatten())  # the product is an ndarray that needs flattening 
         for v in row],  # for every column value in a row 
        # index renamed as columns because of the created DataFrame's shape 
        index=df.columns  
                    # transpose and set the proper index
                    ).T.set_index(
                           pd.date_range(start=i,
                                         periods=pd.Period(i).days_in_month,
                                         freq='D'))  
      for i, row in df.iterrows()])  # iterate over every row

Output:

            cost1       cost2       cost3
2023-01-01  1.703177    1.444117    0.160151
2023-01-02  0.920706    3.664460    0.823405
2023-01-03  1.210426    1.194963    0.294093
2023-01-04  0.214737    1.286273    0.923881
2023-01-05  1.264553    0.380062    0.062829
...         ...         ...         ...
2023-03-27  0.124092    0.615885    0.251369
2023-03-28  0.520578    1.505830    1.632373
2023-03-29  0.245154    3.094078    0.308173
2023-03-30  0.530927    0.406665    1.149860
2023-03-31  0.276992    1.115308    0.432090
90 rows × 3 columns

To assert the monthly sum:

df1.groupby(pd.Grouper(freq='M')).agg('sum')

Output:

            cost1   cost2   cost3
2023-01-31  31.0    62.0    23.0
2023-02-28  14.0    28.0    51.0
2023-03-31  16.0    33.0    21.0
  • Related