Home > other >  Comparing values in two pandas dataframes within a maximum deviation
Comparing values in two pandas dataframes within a maximum deviation

Time:12-27

I have 2 pandas dataframes:

df_func = pd.DataFrame({
    'col_ref': [1, 4],
    'value1': [12.0, 13.1], 'value2': [55.1, 22.1],
    'value3': [22.2, 10.0], 'value4': [111, 0],
})

df_val = pd.DataFrame(
    {'col_ref': [4, 1], 'value': [13.1, 1.0]}
)

And 4 variables max_deviation_value1 max_deviation_value2 max_deviation_value3 max_deviation_value4 containing the maximum deviation allowed for each column in df_func.

For example, say:

max_deviation_value1 = 1.5
max_deviation_value2 = 2
max_deviation_value3 = 4
max_deviation_value4 = 2.3

How can I compare each row in df_val's value column with its corresponding values for all columns in df_func when col_ref is the same and check if the deviation is within the corresponding acceptable maximum deviation value?

Obs.: while col_ref is the common link between rows of the two DataFrames, it is not necessarily in the same order.

CodePudding user response:

The first step is to make sure that col_ref is the index of each DataFrame, in order to make the rest of the operations easier. If that isn't already the case, then:

df_func = df_func.set_index('col_ref')
df_val = df_val.set_index('col_ref')

Then, we need to indicate to pandas that we mean to subtract df_val's 'value' row-wise (from each column value{i}):

diff = df_func.subtract(df_val.squeeze(), axis=0)

>>> diff
         value1  value2  value3  value4
col_ref                                
1          11.0    54.1    21.2   110.0
4           0.0     9.0    -3.1   -13.1

Now, on the max_deviation values: the simplest is to put them in a Series. For example:

max_deviation = pd.Series(dict(
    value1=1.5,
    value2=2.0,
    value3=4.0,
    value4=2.3,
))

Then, the comparison practically writes itself:

>>> diff.abs() <= max_deviation
         value1  value2  value3  value4
col_ref                                
1         False   False   False   False
4          True   False    True   False

Addendum

In the comments, the OP mentioned that there could be duplicate col_ref values in df_func. In that case, the following logic can be applied:

# for example: duplicate col_ref == 1 in df_func
df_func = pd.DataFrame({
    'col_ref': [1, 4, 1],
    'value1': [12.0, 13.1, 5.0],
    'value2': [55.1, 22.1, 5.1],
    'value3': [22.2, 10.0, 2.2],
    'value4': [111, 0, 11],
}).set_index('col_ref')

Then, the following will correctly remove df_val value' for each corresponding col_ref` rows, no matter how many of them there are:

diff = df_func - df_val.reindex(df_func.index).values

>>> diff
         value1  value2  value3  value4
col_ref                                
1          11.0    54.1    21.2   110.0
4           0.0     9.0    -3.1   -13.1
1           4.0     4.1     1.2    10.0

>>> diff.abs() <= max_deviation
         value1  value2  value3  value4
col_ref                                
1         False   False   False   False
4          True   False    True   False
1         False   False    True   False

Last note

If df_val also contains duplicate entries for col_ref, then we can deduplicate (e.g. taking the first one):

# if done after `col_ref` has already been made the index
df_val = df_val.reset_index().drop_duplicates(subset='col_ref', keep='first').set_index('col_ref')
  • Related