I have sales data for different customers on different dates. But the dates are not continuous and I would like to resample the data to daily frequency. How can I do this?

MWE

import numpy as np
import pandas as pd

df = pd.DataFrame({'id': list('aababcbc'),
                  'date': pd.date_range('2022-01-01',periods=8),
                  'value':range(8)}).sort_values('id')


df

id  date    value
0   a   2022-01-01  0
1   a   2022-01-02  1
3   a   2022-01-04  3
2   b   2022-01-03  2
4   b   2022-01-05  4
6   b   2022-01-07  6
5   c   2022-01-06  5
7   c   2022-01-08  7

The required output is following

id  date    value  
a   2022-01-01  0  
a   2022-01-02  1  
a   2022-01-03  0 ** there is no data for a in this day  
a   2022-01-04  3

  
b   2022-01-03  2    
b   2022-01-04  0 ** there is no data for b in this day  
b   2022-01-05  4  
b   2022-01-06  0 ** there is no data for b in this day  
b   2022-01-07  6

  
c   2022-01-06  5  
c   2022-01-07  0 ** there is no data for c in this day
c   2022-01-08  7

My attempt

df.groupby(['id']).resample('D',on='date')['value'].sum().reset_index()

CodePudding user response：

df["date"] = pd.to_datetime(df["date"])
df.set_index("date").groupby("id").resample("1d").sum()

CodePudding user response：

def f(df):
    return df.resample('D', on='date')['value'].sum()
    
df.groupby(['id']).apply(f).reset_index()

produces:

   id       date  value
0   a 2022-01-01      0
1   a 2022-01-02      1
2   a 2022-01-03      0
3   a 2022-01-04      3
4   b 2022-01-03      2
5   b 2022-01-04      0
6   b 2022-01-05      4
7   b 2022-01-06      0
8   b 2022-01-07      6
9   c 2022-01-06      5
10  c 2022-01-07      0
11  c 2022-01-08      7

CodePudding user response：

This is the solution I came up with:

df.groupby(['id']).apply(lambda x: x.resample('D',on='date')['value'].sum()).reset_index()

id  date    value
0   a   2022-01-01  0
1   a   2022-01-02  1
2   a   2022-01-03  0
3   a   2022-01-04  3
4   b   2022-01-03  2
5   b   2022-01-04  0
6   b   2022-01-05  4
7   b   2022-01-06  0
8   b   2022-01-07  6
9   c   2022-01-06  5
10  c   2022-01-07  0
11  c   2022-01-08  7

CodePudding user response：

This might give you some possible help to solve the puzzle.

idx = pd.date_range('2022-01-01', '2022-1-31').to_frame()
df2 = pd.merge(df, idx, how='outer' , left_on='date', right_on=0)

df2

So basically this merge the two frames in a 'outer' way, meaning it returns all possible rows of them.

idx is a DF with all dates you focus your analysis onto, and the NaN will be YOUR the zero values --> ** there is no data for a in this day

Then you can decide to convert, or count or extrapolate the NaN rows according to your needs.