Pandas - checking if value is inside boundary defined in column pairs-CodePudding

I have a bunch of bounding values in a dataframe,

forecast = pd.read_table(
    StringIO(
        """
                                field1_lower     field1_upper   field2_lower     field2_upper
        "2022-09-25 10:00:00"    71.925000        75.275000        73.525000        75.275000
        "2022-09-25 11:00:00"    71.625000        75.359000        73.225000        75.352100
        "2022-09-25 12:00:00"    71.333333        75.339000        71.399333        75.339000
        "2022-09-25 13:00:00"    64.571429        75.139000        90.511429        92.139000
        "2022-09-25 14:00:00"    72.285714        75.339000        50.585714        75.339000
        """
    ),
    delim_whitespace=True,
)

and a bunch of observed measurements,

actual = pd.read_table(
    StringIO(
        """
                                    field1           field2
        "2022-09-25 10:00:00"    72.925000        74.275000
        "2022-09-25 11:00:00"    73.525000        76.359000
        "2022-09-25 12:00:00"    72.664333        72.339000
        "2022-09-25 13:00:00"    70.231429        91.235000
        "2022-09-25 14:00:00"    56.280000        66.239000
        """
    ),
    delim_whitespace=True,
)

I'd like to check if the observed measurements are within the bounds as defined in the forecast-dataframe.

I successfully converted the forecast dataframe into one with tuples,

def convert_column_pairs_to_tuples(df: pd.DataFrame) -> pd.DataFrame:
    column_names = []
    df_new = pd.DataFrame([], index=df.index)

    for fieldname_lower, fieldname_upper in pairwise(list(df.columns)):
        column_name = fieldname_lower.replace('_lower', '')
        column_names.append(column_name)

        assert column_name == fieldname_upper.replace('_upper', '')

        df_new[column_name] = list(zip(df[fieldname_lower], df[fieldname_upper]))

    return df_new

and I suppose I could use .apply to solve the rest, but something doesn't smell right about having to have an intermediary DataFrame with tuples.

Is there an alternative solution wherein I don't have to create the intermediary dataframe?

CodePudding user response：

You can compare columns directly

actual['field1_in_bound'] = (actual['field1']>forecast['field1_lower'])&(actual['field1']<forecast['field1_upper'])

2022-09-25 10:00:00     True
2022-09-25 11:00:00     True
2022-09-25 12:00:00     True
2022-09-25 13:00:00     True
2022-09-25 14:00:00    False
Name: field1_in_bound, dtype: bool

if you have many columns and you wish to do them all at once:

# I assume all your columns are sorted correctly
forecast_lower = forecast.iloc[:, ::2]
forecast_upper = forecast.iloc[:, 1::2]
actual[[c '_in_bound' for c in actual.columns]] = (actual>forecast_lower.values)&(actual<forecast_upper.values)

CodePudding user response：

Another possible solution:

df3 = actual.join(forecast)
fields = actual.columns

pd.concat(map(lambda x: ((df3[x].ge(df3[x   '_lower'])) 
    & (df3[x].le(df3[x   '_upper']))).rename(x   '_res'), fields), axis=1)

Output:

                     field1_res  field2_res
2022-09-25 10:00:00        True        True
2022-09-25 11:00:00        True       False
2022-09-25 12:00:00        True        True
2022-09-25 13:00:00        True        True
2022-09-25 14:00:00       False        True