I'm fairly new to python and pandas.
Several sensors write data to a database at varying intervals, usually many times per minute.
I read this data into a pandas dataframe, resample it to 1 minute and aggregate the values. There are minutes without a data point in col_a and col_b, so the resampled dataframe contains NaNs in these columns. The resampled dataframe looks like this:
import pandas as pd
import numpy as np
d = {'tstamptz': [ '2022-08-09 10:30:00 00:02',
'2022-08-09 10:30:01 00:02',
'2022-08-09 10:30:02 00:02',
'2022-08-09 10:30:03 00:02',
'2022-08-09 10:30:04 00:02',
'2022-08-09 10:30:05 00:02',
'2022-08-09 10:30:06 00:02',
'2022-08-09 10:30:07 00:02',
'2022-08-09 10:30:08 00:02',
'2022-08-09 10:30:09 00:02',
'2022-08-09 10:30:10 00:02'],
'col_a': [413.8, 409.6, np.nan, np.nan, 409.6, 405.1, 417.3, np.nan, np.nan, np.nan, 426.4],
'col_b': [409.6, 407.9, np.nan, np.nan, 405.1, 417.3, 431.4, np.nan, np.nan, np.nan, 419.9],
'col_c': [5.1, 5.1, 5.0, 5.0, 5.0, 5.0, 4.9, 4.9, 5.0, 5.0, 5.1]}
df = pd.DataFrame(d)
df.set_index('tstamptz', inplace=True)
I fill the missing data as follows:
- The last available value of col_b goes into col_a of the first nan row.
- The first available value after the NaN rows of col_a goes into col_b of the first NaN row.
- The values are propagated for all further, consecutive NaN rows: col_b -> col_a
This must be done for all consecutive NaN rows, so that this process is restarted as soon as a non-NaN row follows.
This is my current solution to do this:
result = df.copy()
last_index_position = result.index.get_loc(result.index[-1])
# While there are any NaN rows
while result[['col_a','col_b']].isnull().values.any():
# Get index position of first NaN row
first_nan = list(np.where(result[['col_a','col_b']].isna()))[0][0]
try:
# Check, up to which index position it contains consecutive NaN rows
next_nonan_index = result.index.get_loc(
result[['col_a','col_b']].iloc[first_nan:].first_valid_index()
)
# Catch the KeyError, if there is no valid index left in dataframe (last index is NaN)
except KeyError:
next_nonan_index = last_index_position 1 # add 1 to include last index in ffill method
# Make sure, that there's a non-NaN row in front of the NaN row(s)
# If the row on position 0 is NaN, take 0 instead of -1
if first_nan - 1 >= 0:
fillblock_start = first_nan -1
else:
fillblock_start = 0
# Fill the first NaN row
result['col_a'].iloc[fillblock_start:next_nonan_index].fillna(
df['col_b'].ffill(), limit = 1, inplace = True)
result['col_b'].iloc[fillblock_start:next_nonan_index].fillna(
df['col_a'].bfill(), limit = 1, inplace = True)
# Fill the rest of the rows
result['col_a'].iloc[fillblock_start:next_nonan_index].fillna(
result['col_b'].ffill(), inplace = True)
result['col_b'].iloc[fillblock_start:next_nonan_index].fillna(
result['col_b'].ffill(), inplace = True)
This works as expected, but is terribly awkward.
Question: How can I let pandas do this in a cleaner and probably faster way?
CodePudding user response:
You can do the following in this sequence:
- fill
NaN
s incol_a
with shiftedcol_b
(this only replaces the firstNaN
in each group) - backfill the remaining
NaN
s incol_a
- fill
NaN
s incol_b
with backshiftedcol_a
(this only replaces the firstNaN
in each group) - forward fill the remaining
NaN
s incol_b
result = df.copy()
result.col_a = result.col_a.fillna(result.col_b.shift()).backfill()
result.col_b = result.col_b.fillna(result.col_a.shift(-1)).ffill()
Result:
col_a col_b col_c
tstamptz
2022-08-09 10:30:00 00:02 413.8 409.6 5.1
2022-08-09 10:30:01 00:02 409.6 407.9 5.1
2022-08-09 10:30:02 00:02 407.9 409.6 5.0
2022-08-09 10:30:03 00:02 409.6 409.6 5.0
2022-08-09 10:30:04 00:02 409.6 405.1 5.0
2022-08-09 10:30:05 00:02 405.1 417.3 5.0
2022-08-09 10:30:06 00:02 417.3 431.4 4.9
2022-08-09 10:30:07 00:02 431.4 426.4 4.9
2022-08-09 10:30:08 00:02 426.4 426.4 5.0
2022-08-09 10:30:09 00:02 426.4 426.4 5.0
2022-08-09 10:30:10 00:02 426.4 419.9 5.1