Comparing pandas dataframe with predefined limits-CodePudding

I have a pandas dataframe with the structure below. It contains results from a comparison done previously, and the column minimum_difference showing which column, for that row, contains the smaller absolute difference in that comparison.

df_test

V       |  A    | B   |  C      |  D    | minimum_difference
-10     |  nan  | nan |  nan    |  nan  | nan
-9.9    |  10   | 1   |  -2200  |  100  | B
-9.8    |  11   | 2   |  -2211  |  1    | D

Also, I have target maximum differences for the smallest difference for each of the value columns (A,B,C,D) like this:

max_difference = pd.Series(dict(
    A=1,
    B=2,
    C=10,
    D=0.5,
))

I want to add a new column to df_test that compares the smallest difference against the target value for that column. For example:

V       |  A    | B   |  C      |  D    | minimum_difference | is_within_max_target
-10     |  nan  | nan |  nan    |  nan  | nan                | nan
-9.9    |  10   | 1   |  -2200  |  100  | B                  | TRUE
-9.8    |  11   | 2   |  -2211  |  1    | D                  | FALSE

Any inputs and ideas are very welcome!

CodePudding user response：

Here is a vectorized solution (fast):

# first: the minimum difference

# we use all names defined in the max_difference Series
cols = max_difference.index.tolist()
z = df_test[cols].abs()
df_test['minimum_difference'] = z.idxmin(axis=1)


# second: whether that difference is <= the corresponding max_difference
i = np.argmin(z.values, axis=1)
df_test['is_within_max_target'] = z.values[np.arange(len(i)), i] <= max_difference.values[i]

Note that, for homogeneity (dtype=bool for the last column), we don't carry NaN over in that column.

CodePudding user response：

One method is to apply a function to each row (non-vectorized; may or may not be fast enough for your full dataset):

def check_diffs(row):
    col = row['minimum_difference']
    if col is np.nan:
        return np.nan
    else:
        return row[col] <= max_difference[col]

df_test['is_within_max_target'] = df_test.apply(check_diffs, axis=1)

print(df_test)
# Output given your example data:
      V     A    B       C      D minimum_difference is_within_max_target
0 -10.0   NaN  NaN     NaN    NaN                NaN                  NaN
1  -9.9  10.0  1.0 -2200.0  100.0                  B                 True
2  -9.8  11.0  2.0 -2211.0    1.0                  D                False

CodePudding user response：

I've tried to make it as simple as possible :

import pandas as pd 
import math
data = [[-10 , math.nan, math.nan, math.nan, math.nan, math.nan],
        [-9.9, 10 , 1 , -2200, 100, 'B'],
        [-9.8, 11 , 2 , -2211, 1, 'D']]
df_test = pd.DataFrame(data, columns = ['V', 'A', 'B' , 'C', 'D', 'minimum_difference'])


max_difference = {
                    'A':1,
                    'B':2,
                    'C':10,
                    'D':0.5}

df_result= pd.DataFrame(columns = ['V', 'A', 'B' , 'C', 'D', 'minimum_difference', 'is_within_max_target'])


for i in range(0, len(df_test)):
               current_row = df_test.iloc[i].tolist()
               current_diff_column = str(current_row[5])
           
               if(current_diff_column!='nan'):
                   current_value = df_test[current_diff_column].iloc[i]
                   if(current_value <= max_difference[current_diff_column]):
                       current_row.append('True')
                       df_result.loc[len(df_result)] = current_row
                   else:
                       current_row.append('False')
                       df_result.loc[len(df_result)] = current_row
               else:
                   current_row.append(math.nan)
                   df_result.loc[len(df_result)] = current_row
           
               
print(df_result)

output:

      V     A    B       C      D minimum_difference is_within_max_target
0 -10.0   NaN  NaN     NaN    NaN                NaN                  NaN
1  -9.9  10.0  1.0 -2200.0  100.0                  B                 True
2  -9.8  11.0  2.0 -2211.0    1.0                  D                False