Home > Net >  Group the dataframe based on their ids and stick the values of ids to each other with the same value
Group the dataframe based on their ids and stick the values of ids to each other with the same value

Time:04-16

I have a data frame, I want to group the data based on the ids and then stick the values for each id together. However, since the number of the ids are not equal, I have to fill the missing values for those ids with same values of the day before (or the day after) of those ids. I previously ask this question and I got a response, however it is only for the day before. But, in my real data, since some times I don't have the first day, their values are Nan.

I have provided an example as below, for example, for id=1, id=2 there is two dates, but for id=3, there are three days.

df = pd.DataFrame()
df['id'] = [1, 1, 2,2, 3, 3, 3]
df['date'] = ['2019-01-02', '2019-01-03', '2019-01-01','2019-01-02', '2019-01-01',      '2019-01-02','2019-01-03']
df['val1']  = [10, 100, 20, 30, 40, 50, 60]
df['val2']  = [30, 30, -20, -30, -40,-50, -60 ]
df['val3']  = [50, 10, 120, 300, 140, 150, 160]

The desired output is:

    val0  val1   val2  val3  val4   val5   val6  val7   val8
id                                                          
1   10.0  30.0   50.0  10.0  30.0   50.0  100.0  30.0   10.0
2   20.0 -20.0  120.0  30.0 -30.0  300.0   30.0 -30.0  300.0
3   40.0 -40.0  140.0  50.0 -50.0  150.0   60.0 -60.0  160.0

Here is the response for the case where it is for the day before.

mi = pd.MultiIndex.from_product([df['id'].unique(), df['date'].unique()], names=['id', 'date'])
df = df.set_index(['id', 'date']).reindex(mi).sort_index().ffill()
df = df.groupby(level='id').apply(np.ravel).apply(pd.Series).add_prefix('val')

If you run the code, you can see that the first values for the id=1 are Nan. And also I don't want to sort the id.

Can anybody help me with that? Thanks

CodePudding user response:

One option is to pivot the DataFrame, then use groupby across columns to fill in the missing values:

val_cols = [c for c in df.columns if c.startswith('val')]
out = (df.pivot('id', 'date', val_cols)
       .groupby(level=0, axis=1).apply(lambda x: x.ffill(axis=1).bfill(axis=1))
       .sort_index(axis=1, level=1))
out.columns = [f'val{i}' for i in range(out.shape[1])]

Output:

    val0  val1   val2  val3  val4   val5   val6  val7   val8
id                                                          
1   10.0  30.0   50.0  10.0  30.0   50.0  100.0  30.0   10.0
2   20.0 -20.0  120.0  30.0 -30.0  300.0   30.0 -30.0  300.0
3   40.0 -40.0  140.0  50.0 -50.0  150.0   60.0 -60.0  160.0
  • Related