Home > Software design >  Easier way to fill the missing fields csv using python
Easier way to fill the missing fields csv using python

Time:10-16

I used the groupby method from pandas that can group by id and time in this example csv for example:

| id | month | average tree growth (cm)|
|----|-------|-------------------------|
|  1 |   4   |        9                |
|  1 |   5   |        4                |
|  1 |   6   |        7                |
|  2 |   1   |        9                |
|  2 |   2   |        9                |
|  2 |   3   |        8                |
|  2 |   4   |        6                |

However, each id should have 12 months and I will need to fill in the average tree height at that missing month to be null value, like this:

| id | month | average tree growth (cm)|
|----|-------|-------------------------|
|  1 |   1   |        nan              |
|  1 |   2   |        nan              |
|  1 |   3   |        nan              |
|  1 |   4   |        9                |
|  1 |   5   |        4                |
|  1 |   6   |        7                |
|  1 |   7   |        nan              |
|  1 |   8   |        nan              |
|  1 |   9   |        nan              |
|  1 |   10  |        nan              |
|  1 |   11  |        nan              |
|  1 |   12  |        nan              |
|  2 |   1   |        9                |

This is for bokeh plotting purpose, how do I add the missing month to each id and fill the average height to nan in this case using python? Is there any easier way than brute force looping all id and check for months? Any hint would be appreciated!

CodePudding user response:

One possible solution is the following:

(df.groupby('id')['month']
   .apply(lambda x:np.arange(1, 13))
   .explode()
   .reset_index()
   .merge(df, how='left')
   
)

which produces:

id month  average tree growth (cm)
0    1     1                       NaN
1    1     2                       NaN
2    1     3                       NaN
3    1     4                       9.0
4    1     5                       4.0
5    1     6                       7.0
6    1     7                       NaN
7    1     8                       NaN
8    1     9                       NaN
9    1    10                       NaN
10   1    11                       NaN
11   1    12                       NaN
12   2     1                       9.0
13   2     2                       9.0
14   2     3                       8.0
15   2     4                       6.0
16   2     5                       NaN
17   2     6                       NaN
18   2     7                       NaN
19   2     8                       NaN
20   2     9                       NaN
21   2    10                       NaN
22   2    11                       NaN
23   2    12                       NaN
​

CodePudding user response:

You can use .stack() and .unstack() to reshape month as column, expand to all 12 months by .reindex() and then restore to original shape, as follows:

If you have already grouped by id and month such that id and month are already the row index, you can use:

(df.unstack()
   .stack(level=0)
   .reindex(columns=np.arange(1, 13))
   .stack(dropna=False)
   .unstack(level=1)
   .reset_index()
)

Or, if have not grouped by id and month yet or you used as_index=False in the groupby, such that id and month are not the row index, you can use:

(df.set_index(['id', 'month'])
   .unstack()
   .stack(level=0)
   .reindex(columns=np.arange(1, 13))
   .stack(dropna=False)
   .unstack(level=1)
   .reset_index()
)

Result:

    id  month  average tree growth (cm)
0    1      1                       NaN
1    1      2                       NaN
2    1      3                       NaN
3    1      4                       9.0
4    1      5                       4.0
5    1      6                       7.0
6    1      7                       NaN
7    1      8                       NaN
8    1      9                       NaN
9    1     10                       NaN
10   1     11                       NaN
11   1     12                       NaN
12   2      1                       9.0
13   2      2                       9.0
14   2      3                       8.0
15   2      4                       6.0
16   2      5                       NaN
17   2      6                       NaN
18   2      7                       NaN
19   2      8                       NaN
20   2      9                       NaN
21   2     10                       NaN
22   2     11                       NaN
23   2     12                       NaN
  • Related