Home > Mobile >  Add missing months for each unique id
Add missing months for each unique id

Time:05-17

I am trying to add missing month for each unique id in pandas by filling other columns with previous row values. I have seen this one Pandas: Add data for missing months, but it's for only one id(it doesn't work for multiple id), how can we do the same thing if we have multiple id? For instance I have a data like this

import numpy as np
import pandas as pd
index = [0,1,2,3,4]
id = pd.Series([1,1,1,2,2],index= index)
price = pd.Series([20,41,61,68,10],index= index)
date_month = pd.Series(['2021-06','2021-11','2022-02','2021-04','2021-07'],index= index)
df = pd.DataFrame(id,columns = ["id"])
df["price"] =price
df['date_month'] = date_month
df

But I want the output be like

index = [0,1,2, 3,4,5, 6, 7,8,9,10,11,12]
id = pd.Series([1,1,1,1, 1,1,1,1,1, 2,2,2,2],index= index)
price = pd.Series([20,20,20,20,20, 41,41,41, 61,68,68,68,10],index= index)
date_month = pd.Series(['2021-06', '2021-07','2021-08', '2021-09', '2021-10','2021-11','2021-12', '2022-01', '2022-02','2021-04', '2021-05', 
                        '2021-06','2021-07'],index= index)
df = pd.DataFrame(id,columns = ["id"])
df["price"] =price
df['date_month'] = date_month
df

CodePudding user response:

You can try

out = (df.groupby('id')
       .apply(lambda g: (g.set_index('date_month')
                         .reindex(pd.date_range(start=g['date_month'].min(),
                                                end=g['date_month'].max(), freq='MS').strftime('%Y-%m'))
                         .rename_axis('date_month')
                         .reset_index()
                         .fillna(method='ffill')
                         .astype({'id': int, 'price': int})))
       .reset_index(drop=True))
print(out)

   date_month  id  price
0     2021-06   1     20
1     2021-07   1     20
2     2021-08   1     20
3     2021-09   1     20
4     2021-10   1     20
5     2021-11   1     41
6     2021-12   1     41
7     2022-01   1     41
8     2022-02   1     61
9     2021-04   2     68
10    2021-05   2     68
11    2021-06   2     68
12    2021-07   2     10

CodePudding user response:

One option is the complete function from pyjanitor, to expose the missing rows:

# pip install pyjanitor
import pandas as pd
import janitor as jn

Convert to datetime format:

df['date_month'] = pd.to_datetime(df['date_month'])

Build a new range of dates with the min and max:

new_dates = {'date_month' : lambda df: pd.date_range(df.min(), df.max(), freq='MS')}

Build the new dataframe with new_dates and fill in the nulls for price:

(df
.complete(new_dates, by = 'id')
.assign(price = lambda df: df.price.ffill(downcast = 'infer'))
)

    id  price date_month
0    1     20 2021-06-01
1    1     20 2021-07-01
2    1     20 2021-08-01
3    1     20 2021-09-01
4    1     20 2021-10-01
5    1     41 2021-11-01
6    1     41 2021-12-01
7    1     41 2022-01-01
8    1     61 2022-02-01
9    2     68 2021-04-01
10   2     68 2021-05-01
11   2     68 2021-06-01
12   2     10 2021-07-01
  • Related