Home > database >  pandas dataframe apply function to columns with nans
pandas dataframe apply function to columns with nans

Time:09-28

I have a large pandas dataframe with many time series of different lengths which looks something like this:

df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=30, freq='Q'),
                   'series1': range(30), 
                   'series2': range(30), 
                   'series3': range(30)}).set_index('date')

df.loc[:11, 'series1'] = np.nan
df.loc[:6, 'series2'] = np.nan 

I want to apply a statistical filter from statsmodels on each time series but the modul can't handle missing values and returns a dataframe full of NaNs.

import statsmodels.api as sm
df_cycles, df_trend = sm.tsa.filters.cffilter(df, low=6, high=40)

Is there a clean way to apply the function to each column individually starting from the first non-nan observations? I don't want to exclude the rows with missing values.

CodePudding user response:

This thread might be useful to you. Typically statsmodels have a 'missing' keyword, but as detailed in the linked previous StackOverflow answer, this isn't always the case. The author of the post proposed this solution:

import statsmodels.formula.api as smf

available_data = all_data.loc[:,['y', 'x', 'groupid']].dropna(how='any')
model  = smf.ols('y ~ x', data = available_data)

Which produces no error when used with your code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=30, freq='Q'),
                   'series1': range(30),
                   'series2': range(30),
                   'series3': range(30)}).set_index('date')

df.loc[:11, 'series1'] = np.nan
df.loc[:6, 'series2'] = np.nan

# Change here
available_data = df.dropna(how="any")

import statsmodels.api as sm
df_cycles, df_trend = sm.tsa.filters.cffilter(available_data, low=6, high=40)

CodePudding user response:

You can apply the filter for each column, and store the result in some object. In this case I am making a dictionary with the column name as the key and the cycles and a dataframe containing trend and results as the value.

filtered = {}
for c in df.columns:
    cycles, trend = sm.tsa.filters.cffilter(
        df[c].astype(float).dropna(),
        low=6,
        high=40
    )
    filtered[c] = pd.DataFrame({'cycles':cycles, 'trend':trend})


print(filtered['series1'])

#                  cycles  trend
#date                           
#2015-12-31  8.881784e-16   11.0
#2016-03-31  6.661338e-16   11.0
#...


print(filtered['series3'])

#            cycles  trend
#date                     
#2013-03-31     0.0    0.0
#2013-06-30     0.0    0.0
#...

As you can see, for each column, the rows containing NaNs were dropped without affecting the other columns.

  • Related