I have 2 pandas dataframes:
df_func = pd.DataFrame({
'col_ref': [1, 4],
'value1': [12.0, 13.1], 'value2': [55.1, 22.1],
'value3': [22.2, 10.0], 'value4': [111, 0],
})
df_val = pd.DataFrame(
{'col_ref': [4, 1], 'value': [13.1, 1.0]}
)
And 4 variables max_deviation_value1
max_deviation_value2
max_deviation_value3
max_deviation_value4
containing the maximum deviation allowed for each column in df_func
.
For example, say:
max_deviation_value1 = 1.5
max_deviation_value2 = 2
max_deviation_value3 = 4
max_deviation_value4 = 2.3
How can I compare each row in df_val
's value
column with its corresponding values for all columns in df_func
when col_ref
is the same and check if the deviation is within the corresponding acceptable maximum deviation value?
Obs.: while col_ref
is the common link between rows of the two DataFrames, it is not necessarily in the same order.
CodePudding user response:
The first step is to make sure that col_ref
is the index of each DataFrame, in order to make the rest of the operations easier. If that isn't already the case, then:
df_func = df_func.set_index('col_ref')
df_val = df_val.set_index('col_ref')
Then, we need to indicate to pandas that we mean to subtract df_val's
'value'
row-wise (from each column value{i}
):
diff = df_func.subtract(df_val.squeeze(), axis=0)
>>> diff
value1 value2 value3 value4
col_ref
1 11.0 54.1 21.2 110.0
4 0.0 9.0 -3.1 -13.1
Now, on the max_deviation
values: the simplest is to put them in a Series. For example:
max_deviation = pd.Series(dict(
value1=1.5,
value2=2.0,
value3=4.0,
value4=2.3,
))
Then, the comparison practically writes itself:
>>> diff.abs() <= max_deviation
value1 value2 value3 value4
col_ref
1 False False False False
4 True False True False
Addendum
In the comments, the OP mentioned that there could be duplicate col_ref
values in df_func
. In that case, the following logic can be applied:
# for example: duplicate col_ref == 1 in df_func
df_func = pd.DataFrame({
'col_ref': [1, 4, 1],
'value1': [12.0, 13.1, 5.0],
'value2': [55.1, 22.1, 5.1],
'value3': [22.2, 10.0, 2.2],
'value4': [111, 0, 11],
}).set_index('col_ref')
Then, the following will correctly remove df_val
value' for each corresponding
col_ref` rows, no matter how many of them there are:
diff = df_func - df_val.reindex(df_func.index).values
>>> diff
value1 value2 value3 value4
col_ref
1 11.0 54.1 21.2 110.0
4 0.0 9.0 -3.1 -13.1
1 4.0 4.1 1.2 10.0
>>> diff.abs() <= max_deviation
value1 value2 value3 value4
col_ref
1 False False False False
4 True False True False
1 False False True False
Last note
If df_val
also contains duplicate entries for col_ref
, then we can deduplicate (e.g. taking the first one):
# if done after `col_ref` has already been made the index
df_val = df_val.reset_index().drop_duplicates(subset='col_ref', keep='first').set_index('col_ref')