Home > Blockchain >  Comparing pandas dataframes based on row values and a constant
Comparing pandas dataframes based on row values and a constant

Time:12-26

I have 2 pandas dataframes:

df_test_values

column1 | col_values
0       | 0.99        
3       | 0.21       
1       | 0.55       

df_reference_values

column1 | column2  | column3 ..... column100
0       | 0.80     |  0.21   |
3       | 0.22     |  0.11   |
1       | 0.11     |  0.33   |

Considering I have a constant value permitted_deviation and column1 is the same in both dataframes (same values and number of rows)

How can I compare all rows in df_test_values's col_values against a selected column in df_reference_values and determine how many of its values are within my variable permitted_deviation?

Input would be column_name for df_reference_values and permitted_deviation

An output could be something like this: "column_name has 100 values within the rage permitted_deviation"

Any inputs/ideas are appreciated.

CodePudding user response:

This first suggestion is based on inferring that you want to evaluate each column individually against your permitted deviation, and not all the columns in aggregate.

(df_test_values["col_values"] - df_reference_values["column2"]).abs().apply(lambda x: 1 if x < permitted_deviation else 0).sum()

It uses the apply function, which can have performance issues for large dataframes, but provides a one-line way to count the number of records that meet your condition. I included absolute value on the deviation, figuring you care about the magnitude of the difference and not the direction, so adjust accordingly as necessary.

This also assumes that the rows in your 2 dataframes have the same number of rows (which you indicated), and that the rows they are in the exact same order. It looks like "column1" might be an index that identifies the corresponding rows, and if so you would need to make sure that they are sorted in the same order.


Here's another variation that performs the comparison and count on each column individually, but all at the same time, and it also makes sure that both original dataframes are sorted by column1, assuming that is suitable as an index for the rows:

df_reference_values.set_index("column1").sort_index() \
    .subtract(df_test_values.set_index("column1").sort_index()["col_values"], axis="index") \
    .abs() \
    .where(lambda x: x < permitted_deviation) \
    .count()

Here's another variant that performs the comparison across all of the columns, so that within each row all the columns must meet your criteria for the row to be counted as "passing":

df_reference_values.set_index("column1").sort_index() \
    .subtract(df_test_values.set_index("column1").sort_index()["col_values"], axis="index") \
    .abs() \
    .where(lambda x: x < permitted_deviation) \
    .dropna() \
    .shape[0]

CodePudding user response:

Not a single-line answer, but one that seems to work...

import pandas as pd
import io
...
df1 = pd.read_csv(io.StringIO("""column1|col_values
0       | 0.99        
3       | 0.21       
1       | 0.55  
"""), delimiter="|",skipinitialspace=True)
df2 = pd.read_csv(io.StringIO("""column1| column2| column3|column4
0       | 0.80     |  0.21   | 0.98
3       | 0.22     |  0.11   | 0.15
1       | 0.11     |  0.33   | 0.5
"""),delimiter="|",skipinitialspace=True)
...
permitted_deviation = 0.1
count = 0

for c in df2.columns[1:]:
    ddf = df2[["column1",c]].merge(df1,on="column1")
    ddf[c "_within"] = (ddf[c] > ddf.col_values-permitted_deviation) & (ddf[c] < ddf.col_values permitted_deviation)
    count  = ddf[c "_within"].sum()

(adapt the > to >= etc as required). def.columnx_within is a boolean column, and sum()ing it returns the number of True fields. Note the df.columns[1:] implies the index/cross-reference column is the first column.

  • Related