Home > Back-end >  fill missing data by rule for each consecutive NaN-block
fill missing data by rule for each consecutive NaN-block

Time:08-10

I'm fairly new to python and pandas.

Several sensors write data to a database at varying intervals, usually many times per minute.

I read this data into a pandas dataframe, resample it to 1 minute and aggregate the values. There are minutes without a data point in col_a and col_b, so the resampled dataframe contains NaNs in these columns. The resampled dataframe looks like this:

import pandas as pd
import numpy as np

d = {'tstamptz': [  '2022-08-09 10:30:00 00:02',
                    '2022-08-09 10:30:01 00:02',
                    '2022-08-09 10:30:02 00:02',
                    '2022-08-09 10:30:03 00:02',
                    '2022-08-09 10:30:04 00:02',
                    '2022-08-09 10:30:05 00:02',
                    '2022-08-09 10:30:06 00:02',
                    '2022-08-09 10:30:07 00:02',
                    '2022-08-09 10:30:08 00:02',
                    '2022-08-09 10:30:09 00:02',
                    '2022-08-09 10:30:10 00:02'],
     'col_a': [413.8, 409.6, np.nan, np.nan, 409.6, 405.1, 417.3, np.nan, np.nan, np.nan, 426.4],
     'col_b': [409.6, 407.9, np.nan, np.nan, 405.1, 417.3, 431.4, np.nan, np.nan, np.nan, 419.9],
     'col_c': [5.1, 5.1, 5.0, 5.0, 5.0, 5.0, 4.9, 4.9, 5.0, 5.0, 5.1]}

df = pd.DataFrame(d)
df.set_index('tstamptz', inplace=True)

I fill the missing data as follows:

  • The last available value of col_b goes into col_a of the first nan row.
  • The first available value after the NaN rows of col_a goes into col_b of the first NaN row.
  • The values are propagated for all further, consecutive NaN rows: col_b -> col_a

This must be done for all consecutive NaN rows, so that this process is restarted as soon as a non-NaN row follows.

This is my current solution to do this:

result = df.copy()
last_index_position = result.index.get_loc(result.index[-1])

# While there are any NaN rows
while result[['col_a','col_b']].isnull().values.any():

    # Get index position of first NaN row
   first_nan = list(np.where(result[['col_a','col_b']].isna()))[0][0]
   
   try:
      # Check, up to which index position it contains consecutive NaN rows
      next_nonan_index = result.index.get_loc(
                              result[['col_a','col_b']].iloc[first_nan:].first_valid_index()
                         )
   # Catch the KeyError, if there is no valid index left in dataframe (last index is NaN)
   except KeyError:
      next_nonan_index = last_index_position   1 # add 1 to include last index in ffill method

   # Make sure, that there's a non-NaN row in front of the NaN row(s)
   # If the row on position 0 is NaN, take 0 instead of -1
   if first_nan - 1 >= 0:
      fillblock_start = first_nan -1
   else:
      fillblock_start = 0
   
   # Fill the first NaN row
   result['col_a'].iloc[fillblock_start:next_nonan_index].fillna(
      df['col_b'].ffill(), limit = 1, inplace = True)
   result['col_b'].iloc[fillblock_start:next_nonan_index].fillna(
      df['col_a'].bfill(), limit = 1, inplace = True)
   

   # Fill the rest of the rows
   result['col_a'].iloc[fillblock_start:next_nonan_index].fillna(
      result['col_b'].ffill(), inplace = True)   
   result['col_b'].iloc[fillblock_start:next_nonan_index].fillna(
      result['col_b'].ffill(), inplace = True)

This works as expected, but is terribly awkward.

Question: How can I let pandas do this in a cleaner and probably faster way?

CodePudding user response:

You can do the following in this sequence:

  1. fill NaNs in col_a with shifted col_b (this only replaces the first NaN in each group)
  2. backfill the remaining NaNs in col_a
  3. fill NaNs in col_b with backshifted col_a (this only replaces the first NaN in each group)
  4. forward fill the remaining NaNs in col_b
result = df.copy()
result.col_a = result.col_a.fillna(result.col_b.shift()).backfill()
result.col_b = result.col_b.fillna(result.col_a.shift(-1)).ffill()

Result:

                           col_a  col_b  col_c
tstamptz                                      
2022-08-09 10:30:00 00:02  413.8  409.6    5.1
2022-08-09 10:30:01 00:02  409.6  407.9    5.1
2022-08-09 10:30:02 00:02  407.9  409.6    5.0
2022-08-09 10:30:03 00:02  409.6  409.6    5.0
2022-08-09 10:30:04 00:02  409.6  405.1    5.0
2022-08-09 10:30:05 00:02  405.1  417.3    5.0
2022-08-09 10:30:06 00:02  417.3  431.4    4.9
2022-08-09 10:30:07 00:02  431.4  426.4    4.9
2022-08-09 10:30:08 00:02  426.4  426.4    5.0
2022-08-09 10:30:09 00:02  426.4  426.4    5.0
2022-08-09 10:30:10 00:02  426.4  419.9    5.1
  • Related