Fill missing dates with values from previous row per group with duplicated entries-CodePudding

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01', 2], ['A', '2022-09-02', 1], ['A', '2022-09-04', 3], ['A', '2022-09-06', 2],
        ['A', '2022-09-07', 1], ['A', '2022-09-07', 2], ['A', '2022-09-08', 4], ['A', '2022-09-09', 2],
        ['B', '2022-09-01', 2], ['B', '2022-09-03', 4], ['B', '2022-09-04', 2], ['B', '2022-09-05', 2],
        ['B', '2022-09-07', 1], ['B', '2022-09-08', 3], ['B', '2022-09-10', 2]]
df = pd.DataFrame(data = data, columns = ['group', 'date', 'value'])

df['date'] = pd.to_datetime(df['date'])
df['diff_days'] = (df['date']-df['date'].groupby(df['group']).transform('first')).dt.days

   group       date  value  diff_days
0      A 2022-09-01      2          0
1      A 2022-09-02      1          1
2      A 2022-09-04      3          3
3      A 2022-09-06      2          5
4      A 2022-09-07      1          6
5      A 2022-09-07      2          6
6      A 2022-09-08      4          7
7      A 2022-09-09      2          8
8      B 2022-09-01      2          0
9      B 2022-09-03      4          2
10     B 2022-09-04      2          3
11     B 2022-09-05      2          4
12     B 2022-09-07      1          6
13     B 2022-09-08      3          7
14     B 2022-09-10      2          9

I would like to fill in the missing dates with the values from the previous date per group. I could use the code from this answer, but the problem is that I could have duplicated entries (dates) per group. The following error returns:

df['date'] = pd.to_datetime(df['date'])

df = df.set_index(
    ['date', 'group']
).unstack(
    fill_value=-999
).asfreq(
    'D', fill_value=-999
).stack().sort_index(level=1).reset_index()

df.replace(-999, np.nan).ffill()

ValueError: Index contains duplicate entries, cannot reshape

My desired output should look like this:

data = [['A', '2022-09-01', 2, 0], ['A', '2022-09-02', 1, 1], ['A', '2022-09-03', 1, 1], ['A', '2022-09-04', 3, 3], 
        ['A', '2022-09-05', 3, 3], ['A', '2022-09-06', 2, 5], ['A', '2022-09-07', 1, 6], ['A', '2022-09-07', 2, 6], 
        ['A', '2022-09-08', 4, 7], ['A', '2022-09-09', 2, 8],
        ['B', '2022-09-01', 2, 0], ['B', '2022-09-02', 2, 0], ['B', '2022-09-03', 4, 2], ['B', '2022-09-04', 2, 3], 
        ['B', '2022-09-05', 2, 4], ['B', '2022-09-06', 2, 4], ['B', '2022-09-07', 1, 6], ['B', '2022-09-08', 3, 7], 
        ['B', '2022-09-09', 3, 7], ['B', '2022-09-10', 2, 9]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date', 'value', ' diff_days'])

   group        date  value   diff_days
0      A  2022-09-01      2           0
1      A  2022-09-02      1           1
2      A  2022-09-03      1           1
3      A  2022-09-04      3           3
4      A  2022-09-05      3           3
5      A  2022-09-06      2           5
6      A  2022-09-07      1           6
7      A  2022-09-07      2           6
8      A  2022-09-08      4           7
9      A  2022-09-09      2           8
10     B  2022-09-01      2           0
11     B  2022-09-02      2           0
12     B  2022-09-03      4           2
13     B  2022-09-04      2           3
14     B  2022-09-05      2           4
15     B  2022-09-06      2           4
16     B  2022-09-07      1           6
17     B  2022-09-08      3           7
18     B  2022-09-09      3           7
19     B  2022-09-10      2           9

Some dates explained:

For group A the dates "2022-09-03" and "2022-09-05" are missing. As you can see these get the values from the previous date.
For group B the dates "2022-09-02", "2022-09-06" and "2022-09-09" are missing. Also for these, they get the values from the previous row.

So I was wondering if anyone knows how to fill these missing dates even when there could be duplicated entries per group using Pandas?

CodePudding user response：

You can deduplicate the date with a helper column:

(df.assign(n=df.groupby(['group', 'date']).cumcount())
   .pivot(index=['date', 'n'], columns='group')
   .ffill()
   .stack().reset_index()
   .sort_values(by=['group', 'date'], ignore_index=True)
   [df.columns]
)

output:

   group       date  value  diff_days
0      A 2022-09-01    2.0        0.0
1      A 2022-09-02    1.0        1.0
2      A 2022-09-03    1.0        1.0
3      A 2022-09-04    3.0        3.0
4      A 2022-09-05    3.0        3.0
5      A 2022-09-06    2.0        5.0
6      A 2022-09-07    1.0        6.0
7      A 2022-09-07    2.0        6.0
8      A 2022-09-08    4.0        7.0
9      A 2022-09-09    2.0        8.0
10     A 2022-09-10    2.0        8.0
11     B 2022-09-01    2.0        0.0
12     B 2022-09-02    2.0        0.0
13     B 2022-09-03    4.0        2.0
14     B 2022-09-04    2.0        3.0
15     B 2022-09-05    2.0        4.0
16     B 2022-09-06    2.0        4.0
17     B 2022-09-07    1.0        6.0
18     B 2022-09-07    1.0        6.0
19     B 2022-09-08    3.0        7.0
20     B 2022-09-09    3.0        7.0
21     B 2022-09-10    2.0        9.0

CodePudding user response：

Solution

c = ['group', 'date']
m = df[c].duplicated(keep='last')

s = df[~m].set_index('date').groupby('group').resample('D').ffill()
out = pd.concat([df[m], s.droplevel(0).reset_index()]).sort_values(c)

How this works?

Identify the duplicate rows per group and date
Remove the dupes and resample the dataframe with forward fill
Concat the duplicate rows with the resampled rows to get the result

Result

   group       date  value  diff_days
0      A 2022-09-01      2          0
1      A 2022-09-02      1          1
2      A 2022-09-03      1          1
3      A 2022-09-04      3          3
4      A 2022-09-05      3          3
5      A 2022-09-06      2          5
4      A 2022-09-07      1          6
6      A 2022-09-07      2          6
7      A 2022-09-08      4          7
8      A 2022-09-09      2          8
9      B 2022-09-01      2          0
10     B 2022-09-02      2          0
11     B 2022-09-03      4          2
12     B 2022-09-04      2          3
13     B 2022-09-05      2          4
14     B 2022-09-06      2          4
15     B 2022-09-07      1          6
16     B 2022-09-08      3          7
17     B 2022-09-09      3          7
18     B 2022-09-10      2          9