Home > Back-end >  Filling missing dates and replace the missing value with the previous value
Filling missing dates and replace the missing value with the previous value

Time:10-15

I have panel data. As you can see there are missing dates. I want the date to be continuous and fill the missing value with the previous date's value.

sometimes I only have one observation per token per month. In that case I want all the subsequent days to have the initial value.

    token_address                               date        price1      price2
0   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-01  0.000214    0.573024
1   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-02  0.000214    0.558163
2   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-04  0.000166    0.451243
3   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-05  0.000137    0.387212
4   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-06  0.000117    0.337081
5   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-08  0.000103    0.311277
6   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-10  0.000067    0.245770
7   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-15  0.000085    0.281036
8   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-17  0.000089    0.267481
9   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-19  0.000089    0.281457
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-22  0.000094    0.305423
11  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-01  0.000374    0.991719
12  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-02  0.000382    1.011735
14  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-04  0.000398    1.073131
15  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-05  0.000369    1.044095

so desired df (pls ignore the index as I am manually manipulating this)

    token_address                               date        price1      price2
0   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-01  0.000214    0.573024
1   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-02  0.000214    0.558163
2   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-03  0.000214    0.558163
2   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-04  0.000166    0.451243
3   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-05  0.000137    0.387212
4   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-06  0.000117    0.337081
4   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-07  0.000117    0.337081
5   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-08  0.000103    0.311277
5   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-09  0.000103    0.311277
6   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-10  0.000067    0.245770
6   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-11  0.000067    0.245770
6   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-12  0.000067    0.245770
6   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-13  0.000067    0.245770
6   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-14  0.000067    0.245770
7   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-15  0.000085    0.281036
7   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-16  0.000085    0.281036
8   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-17  0.000089    0.267481
8   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-18  0.000089    0.267481
9   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-19  0.000089    0.281457
9   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-20  0.000089    0.281457
9   0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-21  0.000089    0.281457
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-22  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-23  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-24  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-25  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-26  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-27  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-28  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-29  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-30  0.000094    0.305423
10  0x0000000000004946c0e9f43f4dee607b0ef1fa1c  2021-08-31  0.000094    0.305423
11  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-01  0.000374    0.991719
12  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-02  0.000382    1.011735
13  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-03  0.000382    1.000586
14  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-04  0.000398    1.073131
15  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-05  0.000369    1.044095
......
15  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-29  0.000369    1.044095
15  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-30  0.000369    1.044095
15  0x0000000000085d4780b73119b644ae5ecd22b376  2021-08-31  0.000369    1.044095

CodePudding user response:

One option using Pandas only is to build a DataFrame containing all of token_address and dates from 2018-01-01 to 2018-01-31, and merge to the original dataframe:

tokens = df.token_address.unique()
dates = pd.date_range('2021-08-01', '2021-08-31')
names = ['token_address', 'date']
arr = pd.MultiIndex.from_product([tokens, dates], names=names)
arr = arr.to_frame(index = None)
(pd
.merge_ordered(
    df,
    arr, 
    how = 'outer', 
    on = names,
    fill_method = 'ffill')
)

                                 token_address       date    price1    price2
0   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-01  0.000214  0.573024
1   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-02  0.000214  0.558163
2   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-03  0.000214  0.558163
3   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-04  0.000166  0.451243
4   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-05  0.000137  0.387212
..                                         ...        ...       ...       ...
57  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-27  0.000369  1.044095
58  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-28  0.000369  1.044095
59  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-29  0.000369  1.044095
60  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-30  0.000369  1.044095
61  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-31  0.000369  1.044095

[62 rows x 4 columns]

Kindly verify that the forward fill works properly. if not, you can use a regular merge and ffill after.

Another option is with complete from pyjanitor - in this case, you pass a dictionary of the new dates to the function, to be called on every group in token_address - the key of the dictionary must exist in your dataframe.

# pip install pyjanitor
import janitor
new_dates = {'date' : dates}
df.complete(new_dates, by = 'token_address').ffill()

                                 token_address       date    price1    price2
0   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-01  0.000214  0.573024
1   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-02  0.000214  0.558163
2   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-03  0.000214  0.558163
3   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-04  0.000166  0.451243
4   0x0000000000004946c0e9f43f4dee607b0ef1fa1c 2021-08-05  0.000137  0.387212
..                                         ...        ...       ...       ...
57  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-27  0.000369  1.044095
58  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-28  0.000369  1.044095
59  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-29  0.000369  1.044095
60  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-30  0.000369  1.044095
61  0x0000000000085d4780b73119b644ae5ecd22b376 2021-08-31  0.000369  1.044095

[62 rows x 4 columns]

Again verify the dates are in the proper order; if not you can sort in the preferred order, before forward filling on price1 and price2.

CodePudding user response:

here is one way to do it

# set the index to the date column
# groupby token and resample the index and ffill

df['date'] = pd.to_datetime(df['date'])

(df.set_index('date')
 .groupby(['token_address'])
 .apply(lambda x: x.reindex(pd.date_range(min(df.date), 
                                          max(df['date'])   pd.offsets.MonthEnd(n=0), 
                                          freq ='D'))   )).ffill()
  • Related