I have a large pandas dataframe with many time series of different lengths which looks something like this:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=30, freq='Q'),
'series1': range(30),
'series2': range(30),
'series3': range(30)}).set_index('date')
df.loc[:11, 'series1'] = np.nan
df.loc[:6, 'series2'] = np.nan
I want to apply a statistical filter from statsmodels on each time series but the modul can't handle missing values and returns a dataframe full of NaNs.
import statsmodels.api as sm
df_cycles, df_trend = sm.tsa.filters.cffilter(df, low=6, high=40)
Is there a clean way to apply the function to each column individually starting from the first non-nan observations? I don't want to exclude the rows with missing values.
CodePudding user response:
This thread might be useful to you. Typically statsmodels have a 'missing' keyword, but as detailed in the linked previous StackOverflow answer, this isn't always the case. The author of the post proposed this solution:
import statsmodels.formula.api as smf
available_data = all_data.loc[:,['y', 'x', 'groupid']].dropna(how='any')
model = smf.ols('y ~ x', data = available_data)
Which produces no error when used with your code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=30, freq='Q'),
'series1': range(30),
'series2': range(30),
'series3': range(30)}).set_index('date')
df.loc[:11, 'series1'] = np.nan
df.loc[:6, 'series2'] = np.nan
# Change here
available_data = df.dropna(how="any")
import statsmodels.api as sm
df_cycles, df_trend = sm.tsa.filters.cffilter(available_data, low=6, high=40)
CodePudding user response:
You can apply the filter for each column, and store the result in some object. In this case I am making a dictionary with the column name as the key and the cycles and a dataframe containing trend and results as the value.
filtered = {}
for c in df.columns:
cycles, trend = sm.tsa.filters.cffilter(
df[c].astype(float).dropna(),
low=6,
high=40
)
filtered[c] = pd.DataFrame({'cycles':cycles, 'trend':trend})
print(filtered['series1'])
# cycles trend
#date
#2015-12-31 8.881784e-16 11.0
#2016-03-31 6.661338e-16 11.0
#...
print(filtered['series3'])
# cycles trend
#date
#2013-03-31 0.0 0.0
#2013-06-30 0.0 0.0
#...
As you can see, for each column, the rows containing NaNs were dropped without affecting the other columns.