Home > Enterprise >  Group the dataframe based on ids and stick the values of ids to each other with mean of the last day
Group the dataframe based on ids and stick the values of ids to each other with mean of the last day

Time:03-24

I have a data frame, I want to group the data based on the ids and then stick the values for each id together. However, since the number of the ids are not equal, I have to fill the missing values for those ids with mean of the day before.

I have provided an example as below, for example, for id=1, id=2 there is two dates, but for id=3, there are three days.

df = pd.DataFrame()
df['id'] = [1, 1, 2,2, 3, 3, 3]
df['date'] = ['2019-01-01', '2019-01-03', '2019-01-01','2019-01-02', '2019-01-01',   '2019-01-02','2019-01-03']
df['val1']  = [10, 100, 20, 30, 40, 50, 60]
df['val2']  = [30, 30, -20, -30, -40,-50, -60 ]
df['val3']  = [50, 10, 120, 300, 140, 150, 160]

I have tried with the following code:

DF_sticked            = df.filter(regex='val\d ', axis=1).groupby(df['id'])\
                         .apply(np.ravel).apply(pd.Series).rename(lambda x: f"val{x}", axis=1).reset_index().fillna(0)

But, in the above code, I fill the missing value with 0. And also the missing value are at the end of the columns. The output that I want is as follow. As you can see, the val 4, val5, val 6 of id=1 is same as the mean of the day before. or the val 7, 8, 9 of id=2 is equal to the mean of the day before (mean of 30, -30, 300)

enter image description here

Thank you so much.

CodePudding user response:

You can actually pivot the dataframe, then then groupy the dates, transform('mean'), and ffill:

tmp = df.pivot(index='id', columns='date').swaplevel(axis=1).sort_index(axis=1)
tmp[tmp.isna()] = tmp.groupby(level=0, axis=1).transform('mean').ffill(axis=1)
tmp = tmp.astype(int).droplevel(0, axis=1)

# Fix the columns
tmp = tmp.set_axis(np.arange(1, tmp.shape[1]   1).astype(str), axis=1).add_prefix('val')

Output:

>>> tmp
    val1  val2  val3  val4  val5  val6  val7  val8  val9
id                                                      
1     10    30    50    30    30    30   100    30    10
2     20   -20   120    30   -30   300   100   100   100
3     40   -40   140    50   -50   150    60   -60   160

CodePudding user response:

IIUC, compute the cartesian product between id and date to fill missing dates then use your code:

mi = pd.MultiIndex.from_product([df['id'].unique(), df['date'].unique()], names=['id', 'date'])
df = df.set_index(['id', 'date']).reindex(mi).sort_index().ffill()

# Your code
df = df.groupby(level='id').apply(np.ravel).apply(pd.Series).add_prefix('val')

Output:

>>> df
    val0  val1   val2  val3  val4   val5   val6  val7   val8
id                                                          
1   10.0  30.0   50.0  10.0  30.0   50.0  100.0  30.0   10.0
2   20.0 -20.0  120.0  30.0 -30.0  300.0   30.0 -30.0  300.0
3   40.0 -40.0  140.0  50.0 -50.0  150.0   60.0 -60.0  160.0

  • Related