I am trying to add missing month for each unique id in pandas by filling other columns with previous row values. I have seen this one Pandas: Add data for missing months, but it's for only one id(it doesn't work for multiple id), how can we do the same thing if we have multiple id? For instance I have a data like this
import numpy as np
import pandas as pd
index = [0,1,2,3,4]
id = pd.Series([1,1,1,2,2],index= index)
price = pd.Series([20,41,61,68,10],index= index)
date_month = pd.Series(['2021-06','2021-11','2022-02','2021-04','2021-07'],index= index)
df = pd.DataFrame(id,columns = ["id"])
df["price"] =price
df['date_month'] = date_month
df
But I want the output be like
index = [0,1,2, 3,4,5, 6, 7,8,9,10,11,12]
id = pd.Series([1,1,1,1, 1,1,1,1,1, 2,2,2,2],index= index)
price = pd.Series([20,20,20,20,20, 41,41,41, 61,68,68,68,10],index= index)
date_month = pd.Series(['2021-06', '2021-07','2021-08', '2021-09', '2021-10','2021-11','2021-12', '2022-01', '2022-02','2021-04', '2021-05',
'2021-06','2021-07'],index= index)
df = pd.DataFrame(id,columns = ["id"])
df["price"] =price
df['date_month'] = date_month
df
CodePudding user response:
You can try
out = (df.groupby('id')
.apply(lambda g: (g.set_index('date_month')
.reindex(pd.date_range(start=g['date_month'].min(),
end=g['date_month'].max(), freq='MS').strftime('%Y-%m'))
.rename_axis('date_month')
.reset_index()
.fillna(method='ffill')
.astype({'id': int, 'price': int})))
.reset_index(drop=True))
print(out)
date_month id price
0 2021-06 1 20
1 2021-07 1 20
2 2021-08 1 20
3 2021-09 1 20
4 2021-10 1 20
5 2021-11 1 41
6 2021-12 1 41
7 2022-01 1 41
8 2022-02 1 61
9 2021-04 2 68
10 2021-05 2 68
11 2021-06 2 68
12 2021-07 2 10
CodePudding user response:
One option is the complete function from pyjanitor, to expose the missing rows:
# pip install pyjanitor
import pandas as pd
import janitor as jn
Convert to datetime format:
df['date_month'] = pd.to_datetime(df['date_month'])
Build a new range of dates with the min and max:
new_dates = {'date_month' : lambda df: pd.date_range(df.min(), df.max(), freq='MS')}
Build the new dataframe with new_dates
and fill in the nulls for price
:
(df
.complete(new_dates, by = 'id')
.assign(price = lambda df: df.price.ffill(downcast = 'infer'))
)
id price date_month
0 1 20 2021-06-01
1 1 20 2021-07-01
2 1 20 2021-08-01
3 1 20 2021-09-01
4 1 20 2021-10-01
5 1 41 2021-11-01
6 1 41 2021-12-01
7 1 41 2022-01-01
8 1 61 2022-02-01
9 2 68 2021-04-01
10 2 68 2021-05-01
11 2 68 2021-06-01
12 2 10 2021-07-01