Comparing values in two pandas dataframes within a maximum deviation-CodePudding

I have 2 pandas dataframes:

df_func = pd.DataFrame({
    'col_ref': [1, 4],
    'value1': [12.0, 13.1], 'value2': [55.1, 22.1],
    'value3': [22.2, 10.0], 'value4': [111, 0],
})

df_val = pd.DataFrame(
    {'col_ref': [4, 1], 'value': [13.1, 1.0]}
)

And 4 variables max_deviation_value1 max_deviation_value2 max_deviation_value3 max_deviation_value4 containing the maximum deviation allowed for each column in df_func.

For example, say:

max_deviation_value1 = 1.5
max_deviation_value2 = 2
max_deviation_value3 = 4
max_deviation_value4 = 2.3

How can I compare each row in df_val's value column with its corresponding values for all columns in df_func when col_ref is the same and check if the deviation is within the corresponding acceptable maximum deviation value?

Obs.: while col_ref is the common link between rows of the two DataFrames, it is not necessarily in the same order.

CodePudding user response：

The first step is to make sure that col_ref is the index of each DataFrame, in order to make the rest of the operations easier. If that isn't already the case, then:

df_func = df_func.set_index('col_ref')
df_val = df_val.set_index('col_ref')

Then, we need to indicate to pandas that we mean to subtract df_val's 'value' row-wise (from each column value{i}):

diff = df_func.subtract(df_val.squeeze(), axis=0)

>>> diff
         value1  value2  value3  value4
col_ref                                
1          11.0    54.1    21.2   110.0
4           0.0     9.0    -3.1   -13.1

Now, on the max_deviation values: the simplest is to put them in a Series. For example:

max_deviation = pd.Series(dict(
    value1=1.5,
    value2=2.0,
    value3=4.0,
    value4=2.3,
))

Then, the comparison practically writes itself:

>>> diff.abs() <= max_deviation
         value1  value2  value3  value4
col_ref                                
1         False   False   False   False
4          True   False    True   False

Addendum

In the comments, the OP mentioned that there could be duplicate col_ref values in df_func. In that case, the following logic can be applied:

# for example: duplicate col_ref == 1 in df_func
df_func = pd.DataFrame({
    'col_ref': [1, 4, 1],
    'value1': [12.0, 13.1, 5.0],
    'value2': [55.1, 22.1, 5.1],
    'value3': [22.2, 10.0, 2.2],
    'value4': [111, 0, 11],
}).set_index('col_ref')

Then, the following will correctly remove df_val value' for each corresponding col_ref` rows, no matter how many of them there are:

diff = df_func - df_val.reindex(df_func.index).values

>>> diff
         value1  value2  value3  value4
col_ref                                
1          11.0    54.1    21.2   110.0
4           0.0     9.0    -3.1   -13.1
1           4.0     4.1     1.2    10.0

>>> diff.abs() <= max_deviation
         value1  value2  value3  value4
col_ref                                
1         False   False   False   False
4          True   False    True   False
1         False   False    True   False

Last note

If df_val also contains duplicate entries for col_ref, then we can deduplicate (e.g. taking the first one):

# if done after `col_ref` has already been made the index
df_val = df_val.reset_index().drop_duplicates(subset='col_ref', keep='first').set_index('col_ref')