Combine a dict of varied-length data frames into a single data frame, and pad with zeros to enforce-CodePudding

I have a collection of time Series in a dict:

my_dict = {
     ts_1 : df_1,
     ts_2 : df_2,
     ts_3 : df_3,
     ts_4 : df_4,
     ts_5 : df_5
}

...where each df is of the form:

       yr_mo    number
0   2014-12-01  2
1   2015-01-01  42
2   2015-02-01  7
3   2015-03-01  15
4   2015-04-01  18

I need to convert this dict of data frames into a single df of the form:

     2014-12-01    2015-01-01    2015-02-01    2015-03-01    2015-04-01
0        2             42            7            15             18
1.       7             44           33            38              9
2.       3              6           43             9              3
3.       8              3           77             3              9
4.       1              7           79             2             10

Since these time Series are of different length, I would like to pad the shorter time Series with zeros, so that the dimensionality is the same for each Series.

I have tried by doing the following:

df_collection = pd.DataFrame.from_dict(my_dict, orient='index')

...but this gives the incorrect format.

CodePudding user response：

We can concat the dictionary then remove the individual DataFrame indexes with droplevel and add an index level for yr_mo with set_index then unstack the Series to go to wide format. Lastly, some cleanup to remove axis names and create default range index with rename_axis and reset_index:

new_df = (
    pd.concat(my_dict)
        .droplevel(1).set_index('yr_mo', append=True)['number']
        .unstack(fill_value=0)
        .rename_axis(columns=None)
        .reset_index(drop=True)
)

new_df:

   2014-12-01  2015-01-01  2015-02-01  2015-03-01  2015-04-01
0           0          29          19          38          39
1          28          41           5          25           0
2           3          21           0          35          13
3           7          26          33          23           0
4          10          47          39           0          17

Reproducible random sample Data:

import pandas as pd
from numpy.random import Generator, MT19937

bit_gen = MT19937(15)
rng = Generator(bit_gen)

r = list(range(5))
my_dict = dict(zip(r, [pd.DataFrame({
    'yr_mo': ['2014-12-01', '2015-01-01', '2015-02-01', '2015-03-01',
              '2015-04-01'],
    'number': rng.integers(1, 50, 5)
}).sample(frac=0.9, random_state=bit_gen, ignore_index=True) for _ in r]))

CodePudding user response：

ts_gen = (ts.set_index('yr_mo') for ts in my_dict.values())

res = (
    pd.concat(ts_gen, axis=1,    # concatenate the time series as columns (dates are the index/rows)
              ignore_index=True) # via index alignment ('yr_mo')  
      .T        # transpose so the dates are columns
      .fillna(0)   # fill the missing dates values with zero
      .rename_axis(None, axis=1)  # remove the name of the column axis (which was 'yr_mo')
)

Output

from pandas import Timestamp 

df1 = pd.DataFrame({'yr_mo': [Timestamp('2014-12-01 00:00:00'),
  Timestamp('2015-01-01 00:00:00'),
  Timestamp('2015-02-01 00:00:00'),
  Timestamp('2015-03-01 00:00:00'),
  Timestamp('2015-04-01 00:00:00')],
 'number': [2, 42, 7, 15, 18]})

df2 = pd.DataFrame({'yr_mo': [Timestamp('2014-12-01 00:00:00'),
  Timestamp('2015-01-01 00:00:00'),
  Timestamp('2015-02-01 00:00:00'),
  Timestamp('2015-03-01 00:00:00'),
  Timestamp('2015-04-01 00:00:00'),
  Timestamp("2015-05-01")],
 'number': [2, 4, 2, 10, 18, 20]})

my_dict = {"ts_1" : df1, "ts_2" : df2}

>>> df1 

       yr_mo  number
0 2014-12-01       2
1 2015-01-01      42
2 2015-02-01       7
3 2015-03-01      15
4 2015-04-01      18


>>> df2

       yr_mo  number
0 2014-12-01       2
1 2015-01-01       4
2 2015-02-01       2
3 2015-03-01      10
4 2015-04-01      18
5 2015-05-01      20    # only different/extra date from df1 

>>> res

   2014-12-01  2015-01-01  2015-02-01  2015-03-01  2015-04-01  2015-05-01
0         2.0        42.0         7.0        15.0        18.0         0.0
1         2.0         4.0         2.0        10.0        18.0        20.0