I have a recurrent problem that I solve inelegantly every-time, and I am unable to find a good way to do it. Let's say I have a dataframe with datetime in index, spanning every 3 hours (df1). I have another dataframe spanning every day (df2).
I want to do 2 things:
- Resample the df1 to span everyday instead of every 3h, by calculating a mean of each 3h periods over a day.
- Interpolate df2 for any day that would be missing, and add that day where it belongs.
Issues: I use for loops (and want to avoid that) and the resampling of missing days is incomplete (can only attribute 1 value).
This is how I was doing it:
import numpy as np
import pandas as pd
from datetime import *
# Create df1
rng = pd.date_range('2000-01-01', periods=365*(24/3), freq='3H')
df1 = pd.DataFrame({'Val': np.random.randn(len(rng)) }, index = rng)
# Create df2 and drop a few rows
rng2 = pd.date_range('2000-01-01', periods=365, freq='D')
df2 = pd.DataFrame({'Val': np.random.randn(len(rng2)) },index = rng2)
df2 = df2.drop([datetime(2000,1,5),datetime(2000,1,24)])
# Create reference timelist
date_list = [datetime(2000,1,1) timedelta(days=x) for x in range(365)]
# Calculate the daily mean of df1:
# We create an array hosting the resampled values of df1
arr = []
c = 1
# Loop that appends the array everytime we hit a new day, and calculate a mean of the day that passed
for i in range(1,len(df1)):
if c < 365 and df1.index[i] == date_list[c]:
arr.append(np.mean(df1[i-8:i])[0])
c = c 1
# Calculate the last value of the array
arr.append(np.mean(df1[i-7:i 1])[0])
# Create a new dataframe hosting the daily values from df1
df3 = pd.DataFrame({'Val': arr}, index = rng2)
# Replace missing days in df2
df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear') # but this does not work
CodePudding user response:
I think there are two simple fixes for both these issues; you just need to update your use of resample
for both.
First point: just resample
Your first point is precisely a case of downsampling with resample
. You can replace your whole creation of df3
with:
df1.resample('D').mean()
This is going to average all the 3 hour periods for each day. For confirmation, we can check that your results are the same as what I am proposing:
>>> all(df1.resample('D').mean().round(8) == df3.round(8))
True
Note that I have to round because there are floating point errors between your code and resample
; but they are extremely close.
Second point: don't reindex first
When you interpolate in the second case to fill the missing days, you want to still have the missing days to fill! AKA, if you reindex
first and fill the value with 0
, the interpolation "fails" because it doesn't find anything to interpolate. So if I get your issue correctly, you just want to remove the reindex
line:
# df2 = df2.reindex(date_list, fill_value=0)
df2 = df2.resample('D').interpolate(method='linear')
So if you start with df2
like this:
>>> df.head(10)
Val
2000-01-01 0.235151
2000-01-02 1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182 # the fifth is missing
2000-01-06 0.382649
2000-01-07 0.120253
2000-01-08 -0.223690
2000-01-09 1.379003
2000-01-10 -0.477681
2000-01-11 0.619466
You end with this:
>>> df2.head(10)
Val
2000-01-01 0.235151
2000-01-02 1.279017
2000-01-03 -1.267074
2000-01-04 -0.270182
2000-01-05 0.056233 # the fifth is here, halfway between 4th and 6th
2000-01-06 0.382649
2000-01-07 0.120253
2000-01-08 -0.223690
2000-01-09 1.379003
2000-01-10 -0.477681