I'm working in Python 3 with a Pandas DataFrame. It has columns for Category, Date and Value. For each category, I want to add rows with the missing days, such the value is linearly interpolated.
To create the minimal example, I use the following code
df = pd.DataFrame({
'cat':['A', 'A', 'A', 'A', 'B', 'B', 'B'],
'date': ['2021-1-1', '2021-1-4', '2021-1-5', '2021-1-7', '2021-11-1', '2021-11-2', '2021-11-5'],
'value': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 9.0]})
df['cat'] = df['cat'].astype('category')
df['date'] = df['date'].astype('datetime64')
Which gives the following dataframe
cat date value
A 2021-01-01 1.0
A 2021-01-04 2.0
A 2021-01-05 3.0
A 2021-01-07 4.0
B 2021-11-01 5.0
B 2021-11-02 6.0
B 2021-11-05 9.0
I would like the output to be like this example, Where I used '<' to indicate the newly inserted rows
cat date value
A 2021-01-01 1.0
A 2021-01-02 1.333 <
A 2021-01-03 1.667 <
A 2021-01-04 2.0
A 2021-01-05 3.0
A 2021-01-06 3.5 <
A 2021-01-07 4.0
B 2021-11-01 5.0
B 2021-11-02 6.0
B 2021-11-03 7.0 <
B 2021-11-04 8.0 <
B 2021-11-05 9.0
In the actual problem, I don't want the weekend-days (Saturdays and Sundays), but I've stated the problem like above to prevent adding extra layers (I can easily filter the weekend-days out later if needed). However, not including them in the first place may allow for more efficient code, so I'd thought I'd mention this snag as well. Thanks for any help!
CodePudding user response:
Use DataFrame.groupby
with resample or Series.asfreq
for missing values for days and then interpolate per groups in lambda function:
df = (df.set_index('date')
.groupby('cat')['value']
.apply(lambda x: x.asfreq('d').interpolate())
.reset_index())
print (df)
cat date value
0 A 2021-01-01 1.000000
1 A 2021-01-02 1.333333
2 A 2021-01-03 1.666667
3 A 2021-01-04 2.000000
4 A 2021-01-05 3.000000
5 A 2021-01-06 3.500000
6 A 2021-01-07 4.000000
7 B 2021-11-01 5.000000
8 B 2021-11-02 6.000000
9 B 2021-11-03 7.000000
10 B 2021-11-04 8.000000
11 B 2021-11-05 9.000000
df = (df.set_index('date')
.groupby('cat')['value']
.apply(lambda x: x.resample('d').first().interpolate())
.reset_index())
print (df)
cat date value
0 A 2021-01-01 1.000000
1 A 2021-01-02 1.333333
2 A 2021-01-03 1.666667
3 A 2021-01-04 2.000000
4 A 2021-01-05 3.000000
5 A 2021-01-06 3.500000
6 A 2021-01-07 4.000000
7 B 2021-11-01 5.000000
8 B 2021-11-02 6.000000
9 B 2021-11-03 7.000000
10 B 2021-11-04 8.000000
11 B 2021-11-05 9.000000
Or:
f = lambda x: x.interpolate()
s = df.set_index('date').groupby('cat')['value'].resample('d').first().groupby(level=0).apply(f)
print (s)
cat date
A 2021-01-01 1.000000
2021-01-02 1.333333
2021-01-03 1.666667
2021-01-04 2.000000
2021-01-05 3.000000
2021-01-06 3.500000
2021-01-07 4.000000
B 2021-11-01 5.000000
2021-11-02 6.000000
2021-11-03 7.000000
2021-11-04 8.000000
2021-11-05 9.000000
Name: value, dtype: float64
CodePudding user response:
You could use a helper function:
def interpolate(d, on='date', vals=['value']):
return (d.set_index(on).reindex(pd.date_range(d[on].min(), d[on].max()))
[vals].interpolate()
.rename_axis(on)
)
df.groupby('cat').apply(interpolate).reset_index()
output:
cat date value
0 A 2021-01-01 1.000000
1 A 2021-01-02 1.333333
2 A 2021-01-03 1.666667
3 A 2021-01-04 2.000000
4 A 2021-01-05 3.000000
5 A 2021-01-06 3.500000
6 A 2021-01-07 4.000000
7 B 2021-11-01 5.000000
8 B 2021-11-02 6.000000
9 B 2021-11-03 7.000000
10 B 2021-11-04 8.000000
11 B 2021-11-05 9.000000
CodePudding user response:
An option is a combination of interpolate with complete:
# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
dates = dict(date = lambda df: pd.date_range(df.min(), df.max(), freq='1D'))
(df.complete(dates, by='cat', sort = True)
.assign(value = lambda df: df.value.interpolate())
)
cat date value
0 A 2021-01-01 1.000000
1 A 2021-01-02 1.333333
2 A 2021-01-03 1.666667
3 A 2021-01-04 2.000000
4 A 2021-01-05 3.000000
5 A 2021-01-06 3.500000
6 A 2021-01-07 4.000000
7 B 2021-11-01 5.000000
8 B 2021-11-02 6.000000
9 B 2021-11-03 7.000000
10 B 2021-11-04 8.000000
11 B 2021-11-05 9.000000
complete
exposes the missing values,after which we then interpolate on a linear
method.