I have a data frame, I want to group the data based on the ids and then stick the values for each id together. However, since the number of the ids are not equal, I have to fill the missing values for those ids with mean of the day before
.
I have provided an example as below, for example, for id=1
, id=2
there is two dates, but for id=3
, there are three days.
df = pd.DataFrame()
df['id'] = [1, 1, 2,2, 3, 3, 3]
df['date'] = ['2019-01-01', '2019-01-03', '2019-01-01','2019-01-02', '2019-01-01', '2019-01-02','2019-01-03']
df['val1'] = [10, 100, 20, 30, 40, 50, 60]
df['val2'] = [30, 30, -20, -30, -40,-50, -60 ]
df['val3'] = [50, 10, 120, 300, 140, 150, 160]
I have tried with the following code:
DF_sticked = df.filter(regex='val\d ', axis=1).groupby(df['id'])\
.apply(np.ravel).apply(pd.Series).rename(lambda x: f"val{x}", axis=1).reset_index().fillna(0)
But, in the above code, I fill the missing value with 0. And also the missing value are at the end of the columns.
The output that I want is as follow. As you can see, the val 4, val5, val 6
of id=1
is same as the mean of the day before. or the val 7, 8, 9 of id=2
is equal to the mean of the day before (mean of 30, -30, 300)
Thank you so much.
CodePudding user response:
You can actually pivot the dataframe, then then groupy the dates, transform('mean')
, and ffill
:
tmp = df.pivot(index='id', columns='date').swaplevel(axis=1).sort_index(axis=1)
tmp[tmp.isna()] = tmp.groupby(level=0, axis=1).transform('mean').ffill(axis=1)
tmp = tmp.astype(int).droplevel(0, axis=1)
# Fix the columns
tmp = tmp.set_axis(np.arange(1, tmp.shape[1] 1).astype(str), axis=1).add_prefix('val')
Output:
>>> tmp
val1 val2 val3 val4 val5 val6 val7 val8 val9
id
1 10 30 50 30 30 30 100 30 10
2 20 -20 120 30 -30 300 100 100 100
3 40 -40 140 50 -50 150 60 -60 160
CodePudding user response:
IIUC, compute the cartesian product between id
and date
to fill missing dates then use your code:
mi = pd.MultiIndex.from_product([df['id'].unique(), df['date'].unique()], names=['id', 'date'])
df = df.set_index(['id', 'date']).reindex(mi).sort_index().ffill()
# Your code
df = df.groupby(level='id').apply(np.ravel).apply(pd.Series).add_prefix('val')
Output:
>>> df
val0 val1 val2 val3 val4 val5 val6 val7 val8
id
1 10.0 30.0 50.0 10.0 30.0 50.0 100.0 30.0 10.0
2 20.0 -20.0 120.0 30.0 -30.0 300.0 30.0 -30.0 300.0
3 40.0 -40.0 140.0 50.0 -50.0 150.0 60.0 -60.0 160.0