fill missing date column in python dataframe-CodePudding

Need help in filling gaps for missing beginning date of a month in df_1 (for example: 01, 02, 05, and 07 to 11), I need to have a continuous months (i.e. 12) .

In: df_1 = pd.DataFrame([['2021-03-01', 'Supp_1', 'Product_1', '1'],
                 ['2021-04-01', 'Supp_1', 'Product_1', 1],
                 ['2021-06-01','Supp_1', 'Product_1', 1],
                 ['2021-12-01', 'Supp_1', 'Product_1', 1.25]],
                columns=['Date','Supplier','Product','Cost'])
Out: 

       Date     Supplier    Product     Cost
0   2021-03-01  Supp_1  Product_1   1
1   2021-04-01  Supp_1  Product_1   1
2   2021-06-01  Supp_1  Product_1   1
3   2021-12-01  Supp_1  Product_1   1.25

Expected result is,

      Date     Supplier Product     Cost
0   2021-01-01  Supp_1  Product_1   1
1   2021-02-01  Supp_1  Product_1   1
2   2021-03-01  Supp_1  Product_1   
3   2021-04-01  Supp_1  Product_1   
4   2021-05-01  Supp_1  Product_1   
5   2021-06-01  Supp_1  Product_1   1
6   2021-07-01  Supp_1  Product_1   
7   2021-08-01  Supp_1  Product_1   
8   2021-09-01  Supp_1  Product_1   
9   2021-10-01  Supp_1  Product_1   
10  2021-11-01  Supp_1  Product_1   
11  2021-12-01  Supp_1  Product_1   1.25

Once we have the df_2, then I can use ffill() and bfill() to fill the gaps for 'Cost'

CodePudding user response：

You can use resample:

print (df_1.assign(Date=pd.to_datetime(df_1["Date"]))
           .set_index("Date")
           .resample("MS").asfreq()
           .reset_index())

         Date Supplier    Product  Cost
0  2021-01-01   Supp_1  Product_1     1
1  2021-02-01   Supp_1  Product_1     1
2  2021-03-01      NaN        NaN   NaN
3  2021-04-01      NaN        NaN   NaN
4  2021-05-01      NaN        NaN   NaN
5  2021-06-01   Supp_1  Product_1     1
6  2021-07-01      NaN        NaN   NaN
7  2021-08-01      NaN        NaN   NaN
8  2021-09-01      NaN        NaN   NaN
9  2021-10-01      NaN        NaN   NaN
10 2021-11-01      NaN        NaN   NaN
11 2021-12-01   Supp_1  Product_1  1.25

CodePudding user response：

You can use this pipeline. The key step is to set_index Date and use asfreq:

(df1.assign(Date=pd.to_datetime(df1['Date']))
    .set_index('Date')
    .asfreq('MS')
    .assign(Supplier=lambda d: d['Supplier'].ffill(),
            Product=lambda d: d['Product'].ffill()
           )
    .reset_index()
)

output:

         Date Supplier    Product  Cost
0  2021-01-01   Supp_1  Product_1     1
1  2021-02-01   Supp_1  Product_1     1
2  2021-03-01   Supp_1  Product_1   NaN
3  2021-04-01   Supp_1  Product_1   NaN
4  2021-05-01   Supp_1  Product_1   NaN
5  2021-06-01   Supp_1  Product_1     1
6  2021-07-01   Supp_1  Product_1   NaN
7  2021-08-01   Supp_1  Product_1   NaN
8  2021-09-01   Supp_1  Product_1   NaN
9  2021-10-01   Supp_1  Product_1   NaN
10 2021-11-01   Supp_1  Product_1   NaN
11 2021-12-01   Supp_1  Product_1  1.25

CodePudding user response：

Another option:

df_1.Date = pd.to_datetime(df_1.Date)
df_1 = df_1.set_index('Date').asfreq('MS').reset_index()
df_1

    Date    Supplier    Product Cost
0   2021-01-01  Supp_1  Product_1   1
1   2021-01-02  NaN NaN NaN
2   2021-01-03  NaN NaN NaN
3   2021-01-04  NaN NaN NaN
4   2021-01-05  NaN NaN NaN
... ... ... ... ...
330 2021-11-27  NaN NaN NaN
331 2021-11-28  NaN NaN NaN
332 2021-11-29  NaN NaN NaN
333 2021-11-30  NaN NaN NaN
334 2021-12-01  Supp_1  Product_1   1.25
335 rows × 4 columns

CodePudding user response：

If I get you correctly, you want to expose the missing rows, based on a combination of all the months in the year, along with Supplier and Product, and then forward/backward fill on the Cost column.

Maybe the complete function from pyjanitor could help:

# pip git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

year = df.Date.dt.year.at[0]
months = pd.date_range(f"{year}-01-01", f"{year}-12-01", freq="MS")
months = dict(Date = months)

 df.complete(months, 'Supplier', 'Product', sort = True)
 
         Date Supplier    Product  Cost
0  2021-01-01   Supp_1  Product_1   NaN
1  2021-02-01   Supp_1  Product_1   NaN
2  2021-03-01   Supp_1  Product_1     1
3  2021-04-01   Supp_1  Product_1     1
4  2021-05-01   Supp_1  Product_1   NaN
5  2021-06-01   Supp_1  Product_1     1
6  2021-07-01   Supp_1  Product_1   NaN
7  2021-08-01   Supp_1  Product_1   NaN
8  2021-09-01   Supp_1  Product_1   NaN
9  2021-10-01   Supp_1  Product_1   NaN
10 2021-11-01   Supp_1  Product_1   NaN
11 2021-12-01   Supp_1  Product_1  1.25

You can then fill up or down on the Cost column.