I used the groupby method from pandas that can group by id and time in this example csv for example:
| id | month | average tree growth (cm)|
|----|-------|-------------------------|
| 1 | 4 | 9 |
| 1 | 5 | 4 |
| 1 | 6 | 7 |
| 2 | 1 | 9 |
| 2 | 2 | 9 |
| 2 | 3 | 8 |
| 2 | 4 | 6 |
However, each id should have 12 months and I will need to fill in the average tree height at that missing month to be null value, like this:
| id | month | average tree growth (cm)|
|----|-------|-------------------------|
| 1 | 1 | nan |
| 1 | 2 | nan |
| 1 | 3 | nan |
| 1 | 4 | 9 |
| 1 | 5 | 4 |
| 1 | 6 | 7 |
| 1 | 7 | nan |
| 1 | 8 | nan |
| 1 | 9 | nan |
| 1 | 10 | nan |
| 1 | 11 | nan |
| 1 | 12 | nan |
| 2 | 1 | 9 |
This is for bokeh plotting purpose, how do I add the missing month to each id and fill the average height to nan in this case using python? Is there any easier way than brute force looping all id and check for months? Any hint would be appreciated!
CodePudding user response:
One possible solution is the following:
(df.groupby('id')['month']
.apply(lambda x:np.arange(1, 13))
.explode()
.reset_index()
.merge(df, how='left')
)
which produces:
id month average tree growth (cm)
0 1 1 NaN
1 1 2 NaN
2 1 3 NaN
3 1 4 9.0
4 1 5 4.0
5 1 6 7.0
6 1 7 NaN
7 1 8 NaN
8 1 9 NaN
9 1 10 NaN
10 1 11 NaN
11 1 12 NaN
12 2 1 9.0
13 2 2 9.0
14 2 3 8.0
15 2 4 6.0
16 2 5 NaN
17 2 6 NaN
18 2 7 NaN
19 2 8 NaN
20 2 9 NaN
21 2 10 NaN
22 2 11 NaN
23 2 12 NaN
CodePudding user response:
You can use .stack()
and .unstack()
to reshape month
as column, expand to all 12 months by .reindex()
and then restore to original shape, as follows:
If you have already grouped by id
and month
such that id
and month
are already the row index, you can use:
(df.unstack()
.stack(level=0)
.reindex(columns=np.arange(1, 13))
.stack(dropna=False)
.unstack(level=1)
.reset_index()
)
Or, if have not grouped by id
and month
yet or you used as_index=False
in the groupby, such that id
and month
are not the row index, you can use:
(df.set_index(['id', 'month'])
.unstack()
.stack(level=0)
.reindex(columns=np.arange(1, 13))
.stack(dropna=False)
.unstack(level=1)
.reset_index()
)
Result:
id month average tree growth (cm)
0 1 1 NaN
1 1 2 NaN
2 1 3 NaN
3 1 4 9.0
4 1 5 4.0
5 1 6 7.0
6 1 7 NaN
7 1 8 NaN
8 1 9 NaN
9 1 10 NaN
10 1 11 NaN
11 1 12 NaN
12 2 1 9.0
13 2 2 9.0
14 2 3 8.0
15 2 4 6.0
16 2 5 NaN
17 2 6 NaN
18 2 7 NaN
19 2 8 NaN
20 2 9 NaN
21 2 10 NaN
22 2 11 NaN
23 2 12 NaN