I have sales data for different customers on different dates. But the dates are not continuous and I would like to resample the data to daily frequency. How can I do this?
MWE
import numpy as np
import pandas as pd
df = pd.DataFrame({'id': list('aababcbc'),
'date': pd.date_range('2022-01-01',periods=8),
'value':range(8)}).sort_values('id')
df
id date value
0 a 2022-01-01 0
1 a 2022-01-02 1
3 a 2022-01-04 3
2 b 2022-01-03 2
4 b 2022-01-05 4
6 b 2022-01-07 6
5 c 2022-01-06 5
7 c 2022-01-08 7
The required output is following
id date value
a 2022-01-01 0
a 2022-01-02 1
a 2022-01-03 0 ** there is no data for a in this day
a 2022-01-04 3
b 2022-01-03 2
b 2022-01-04 0 ** there is no data for b in this day
b 2022-01-05 4
b 2022-01-06 0 ** there is no data for b in this day
b 2022-01-07 6
c 2022-01-06 5
c 2022-01-07 0 ** there is no data for c in this day
c 2022-01-08 7
My attempt
df.groupby(['id']).resample('D',on='date')['value'].sum().reset_index()
CodePudding user response:
df["date"] = pd.to_datetime(df["date"])
df.set_index("date").groupby("id").resample("1d").sum()
CodePudding user response:
def f(df):
return df.resample('D', on='date')['value'].sum()
df.groupby(['id']).apply(f).reset_index()
produces:
id date value
0 a 2022-01-01 0
1 a 2022-01-02 1
2 a 2022-01-03 0
3 a 2022-01-04 3
4 b 2022-01-03 2
5 b 2022-01-04 0
6 b 2022-01-05 4
7 b 2022-01-06 0
8 b 2022-01-07 6
9 c 2022-01-06 5
10 c 2022-01-07 0
11 c 2022-01-08 7
CodePudding user response:
This is the solution I came up with:
df.groupby(['id']).apply(lambda x: x.resample('D',on='date')['value'].sum()).reset_index()
id date value
0 a 2022-01-01 0
1 a 2022-01-02 1
2 a 2022-01-03 0
3 a 2022-01-04 3
4 b 2022-01-03 2
5 b 2022-01-04 0
6 b 2022-01-05 4
7 b 2022-01-06 0
8 b 2022-01-07 6
9 c 2022-01-06 5
10 c 2022-01-07 0
11 c 2022-01-08 7
CodePudding user response:
This might give you some possible help to solve the puzzle.
idx = pd.date_range('2022-01-01', '2022-1-31').to_frame()
df2 = pd.merge(df, idx, how='outer' , left_on='date', right_on=0)
df2
So basically this merge the two frames in a 'outer' way, meaning it returns all possible rows of them.
idx
is a DF with all dates you focus your analysis onto, and the NaN will be YOUR the zero values --> ** there is no data for a in this day
Then you can decide to convert, or count or extrapolate the NaN
rows according to your needs.