I have a collection of time Series in a dict
:
my_dict = {
ts_1 : df_1,
ts_2 : df_2,
ts_3 : df_3,
ts_4 : df_4,
ts_5 : df_5
}
...where each df
is of the form:
yr_mo number
0 2014-12-01 2
1 2015-01-01 42
2 2015-02-01 7
3 2015-03-01 15
4 2015-04-01 18
I need to convert this dict
of data frames into a single df
of the form:
2014-12-01 2015-01-01 2015-02-01 2015-03-01 2015-04-01
0 2 42 7 15 18
1. 7 44 33 38 9
2. 3 6 43 9 3
3. 8 3 77 3 9
4. 1 7 79 2 10
Since these time Series are of different length, I would like to pad the shorter time Series with zeros, so that the dimensionality is the same for each Series.
I have tried by doing the following:
df_collection = pd.DataFrame.from_dict(my_dict, orient='index')
...but this gives the incorrect format.
CodePudding user response:
We can concat
the dictionary then remove the individual DataFrame indexes with droplevel
and add an index level for yr_mo
with set_index
then unstack
the Series to go to wide format. Lastly, some cleanup to remove axis names and create default range index with rename_axis
and reset_index
:
new_df = (
pd.concat(my_dict)
.droplevel(1).set_index('yr_mo', append=True)['number']
.unstack(fill_value=0)
.rename_axis(columns=None)
.reset_index(drop=True)
)
new_df
:
2014-12-01 2015-01-01 2015-02-01 2015-03-01 2015-04-01
0 0 29 19 38 39
1 28 41 5 25 0
2 3 21 0 35 13
3 7 26 33 23 0
4 10 47 39 0 17
Reproducible random sample Data:
import pandas as pd
from numpy.random import Generator, MT19937
bit_gen = MT19937(15)
rng = Generator(bit_gen)
r = list(range(5))
my_dict = dict(zip(r, [pd.DataFrame({
'yr_mo': ['2014-12-01', '2015-01-01', '2015-02-01', '2015-03-01',
'2015-04-01'],
'number': rng.integers(1, 50, 5)
}).sample(frac=0.9, random_state=bit_gen, ignore_index=True) for _ in r]))
CodePudding user response:
ts_gen = (ts.set_index('yr_mo') for ts in my_dict.values())
res = (
pd.concat(ts_gen, axis=1, # concatenate the time series as columns (dates are the index/rows)
ignore_index=True) # via index alignment ('yr_mo')
.T # transpose so the dates are columns
.fillna(0) # fill the missing dates values with zero
.rename_axis(None, axis=1) # remove the name of the column axis (which was 'yr_mo')
)
Output
from pandas import Timestamp
df1 = pd.DataFrame({'yr_mo': [Timestamp('2014-12-01 00:00:00'),
Timestamp('2015-01-01 00:00:00'),
Timestamp('2015-02-01 00:00:00'),
Timestamp('2015-03-01 00:00:00'),
Timestamp('2015-04-01 00:00:00')],
'number': [2, 42, 7, 15, 18]})
df2 = pd.DataFrame({'yr_mo': [Timestamp('2014-12-01 00:00:00'),
Timestamp('2015-01-01 00:00:00'),
Timestamp('2015-02-01 00:00:00'),
Timestamp('2015-03-01 00:00:00'),
Timestamp('2015-04-01 00:00:00'),
Timestamp("2015-05-01")],
'number': [2, 4, 2, 10, 18, 20]})
my_dict = {"ts_1" : df1, "ts_2" : df2}
>>> df1
yr_mo number
0 2014-12-01 2
1 2015-01-01 42
2 2015-02-01 7
3 2015-03-01 15
4 2015-04-01 18
>>> df2
yr_mo number
0 2014-12-01 2
1 2015-01-01 4
2 2015-02-01 2
3 2015-03-01 10
4 2015-04-01 18
5 2015-05-01 20 # only different/extra date from df1
>>> res
2014-12-01 2015-01-01 2015-02-01 2015-03-01 2015-04-01 2015-05-01
0 2.0 42.0 7.0 15.0 18.0 0.0
1 2.0 4.0 2.0 10.0 18.0 20.0