Need help in filling gaps for missing beginning date of a month in df_1 (for example: 01, 02, 05, and 07 to 11), I need to have a continuous months (i.e. 12) .
In: df_1 = pd.DataFrame([['2021-03-01', 'Supp_1', 'Product_1', '1'],
['2021-04-01', 'Supp_1', 'Product_1', 1],
['2021-06-01','Supp_1', 'Product_1', 1],
['2021-12-01', 'Supp_1', 'Product_1', 1.25]],
columns=['Date','Supplier','Product','Cost'])
Out:
Date Supplier Product Cost
0 2021-03-01 Supp_1 Product_1 1
1 2021-04-01 Supp_1 Product_1 1
2 2021-06-01 Supp_1 Product_1 1
3 2021-12-01 Supp_1 Product_1 1.25
Expected result is,
Date Supplier Product Cost
0 2021-01-01 Supp_1 Product_1 1
1 2021-02-01 Supp_1 Product_1 1
2 2021-03-01 Supp_1 Product_1
3 2021-04-01 Supp_1 Product_1
4 2021-05-01 Supp_1 Product_1
5 2021-06-01 Supp_1 Product_1 1
6 2021-07-01 Supp_1 Product_1
7 2021-08-01 Supp_1 Product_1
8 2021-09-01 Supp_1 Product_1
9 2021-10-01 Supp_1 Product_1
10 2021-11-01 Supp_1 Product_1
11 2021-12-01 Supp_1 Product_1 1.25
Once we have the df_2, then I can use ffill() and bfill() to fill the gaps for 'Cost'
CodePudding user response:
You can use resample
:
print (df_1.assign(Date=pd.to_datetime(df_1["Date"]))
.set_index("Date")
.resample("MS").asfreq()
.reset_index())
Date Supplier Product Cost
0 2021-01-01 Supp_1 Product_1 1
1 2021-02-01 Supp_1 Product_1 1
2 2021-03-01 NaN NaN NaN
3 2021-04-01 NaN NaN NaN
4 2021-05-01 NaN NaN NaN
5 2021-06-01 Supp_1 Product_1 1
6 2021-07-01 NaN NaN NaN
7 2021-08-01 NaN NaN NaN
8 2021-09-01 NaN NaN NaN
9 2021-10-01 NaN NaN NaN
10 2021-11-01 NaN NaN NaN
11 2021-12-01 Supp_1 Product_1 1.25
CodePudding user response:
You can use this pipeline. The key step is to set_index
Date and use asfreq
:
(df1.assign(Date=pd.to_datetime(df1['Date']))
.set_index('Date')
.asfreq('MS')
.assign(Supplier=lambda d: d['Supplier'].ffill(),
Product=lambda d: d['Product'].ffill()
)
.reset_index()
)
output:
Date Supplier Product Cost
0 2021-01-01 Supp_1 Product_1 1
1 2021-02-01 Supp_1 Product_1 1
2 2021-03-01 Supp_1 Product_1 NaN
3 2021-04-01 Supp_1 Product_1 NaN
4 2021-05-01 Supp_1 Product_1 NaN
5 2021-06-01 Supp_1 Product_1 1
6 2021-07-01 Supp_1 Product_1 NaN
7 2021-08-01 Supp_1 Product_1 NaN
8 2021-09-01 Supp_1 Product_1 NaN
9 2021-10-01 Supp_1 Product_1 NaN
10 2021-11-01 Supp_1 Product_1 NaN
11 2021-12-01 Supp_1 Product_1 1.25
CodePudding user response:
Another option:
df_1.Date = pd.to_datetime(df_1.Date)
df_1 = df_1.set_index('Date').asfreq('MS').reset_index()
df_1
Date Supplier Product Cost
0 2021-01-01 Supp_1 Product_1 1
1 2021-01-02 NaN NaN NaN
2 2021-01-03 NaN NaN NaN
3 2021-01-04 NaN NaN NaN
4 2021-01-05 NaN NaN NaN
... ... ... ... ...
330 2021-11-27 NaN NaN NaN
331 2021-11-28 NaN NaN NaN
332 2021-11-29 NaN NaN NaN
333 2021-11-30 NaN NaN NaN
334 2021-12-01 Supp_1 Product_1 1.25
335 rows × 4 columns
CodePudding user response:
If I get you correctly, you want to expose the missing rows, based on a combination of all the months in the year, along with Supplier
and Product
, and then forward/backward fill on the Cost
column.
Maybe the complete function from pyjanitor could help:
# pip git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
year = df.Date.dt.year.at[0]
months = pd.date_range(f"{year}-01-01", f"{year}-12-01", freq="MS")
months = dict(Date = months)
df.complete(months, 'Supplier', 'Product', sort = True)
Date Supplier Product Cost
0 2021-01-01 Supp_1 Product_1 NaN
1 2021-02-01 Supp_1 Product_1 NaN
2 2021-03-01 Supp_1 Product_1 1
3 2021-04-01 Supp_1 Product_1 1
4 2021-05-01 Supp_1 Product_1 NaN
5 2021-06-01 Supp_1 Product_1 1
6 2021-07-01 Supp_1 Product_1 NaN
7 2021-08-01 Supp_1 Product_1 NaN
8 2021-09-01 Supp_1 Product_1 NaN
9 2021-10-01 Supp_1 Product_1 NaN
10 2021-11-01 Supp_1 Product_1 NaN
11 2021-12-01 Supp_1 Product_1 1.25
You can then fill up or down on the Cost
column.