I have a pandas dataframe with the structure below. It contains results from a comparison done previously, and the column minimum_difference
showing which column, for that row, contains the smaller absolute difference in that comparison.
df_test
V | A | B | C | D | minimum_difference
-10 | nan | nan | nan | nan | nan
-9.9 | 10 | 1 | -2200 | 100 | B
-9.8 | 11 | 2 | -2211 | 1 | D
Also, I have target maximum differences for the smallest difference for each of the value columns (A,B,C,D) like this:
max_difference = pd.Series(dict(
A=1,
B=2,
C=10,
D=0.5,
))
I want to add a new column to df_test that compares the smallest difference against the target value for that column. For example:
V | A | B | C | D | minimum_difference | is_within_max_target
-10 | nan | nan | nan | nan | nan | nan
-9.9 | 10 | 1 | -2200 | 100 | B | TRUE
-9.8 | 11 | 2 | -2211 | 1 | D | FALSE
Any inputs and ideas are very welcome!
CodePudding user response:
Here is a vectorized solution (fast):
# first: the minimum difference
# we use all names defined in the max_difference Series
cols = max_difference.index.tolist()
z = df_test[cols].abs()
df_test['minimum_difference'] = z.idxmin(axis=1)
# second: whether that difference is <= the corresponding max_difference
i = np.argmin(z.values, axis=1)
df_test['is_within_max_target'] = z.values[np.arange(len(i)), i] <= max_difference.values[i]
Note that, for homogeneity (dtype=bool
for the last column), we don't carry NaN
over in that column.
CodePudding user response:
One method is to apply a function to each row (non-vectorized; may or may not be fast enough for your full dataset):
def check_diffs(row):
col = row['minimum_difference']
if col is np.nan:
return np.nan
else:
return row[col] <= max_difference[col]
df_test['is_within_max_target'] = df_test.apply(check_diffs, axis=1)
print(df_test)
# Output given your example data:
V A B C D minimum_difference is_within_max_target
0 -10.0 NaN NaN NaN NaN NaN NaN
1 -9.9 10.0 1.0 -2200.0 100.0 B True
2 -9.8 11.0 2.0 -2211.0 1.0 D False
CodePudding user response:
I've tried to make it as simple as possible :
import pandas as pd
import math
data = [[-10 , math.nan, math.nan, math.nan, math.nan, math.nan],
[-9.9, 10 , 1 , -2200, 100, 'B'],
[-9.8, 11 , 2 , -2211, 1, 'D']]
df_test = pd.DataFrame(data, columns = ['V', 'A', 'B' , 'C', 'D', 'minimum_difference'])
max_difference = {
'A':1,
'B':2,
'C':10,
'D':0.5}
df_result= pd.DataFrame(columns = ['V', 'A', 'B' , 'C', 'D', 'minimum_difference', 'is_within_max_target'])
for i in range(0, len(df_test)):
current_row = df_test.iloc[i].tolist()
current_diff_column = str(current_row[5])
if(current_diff_column!='nan'):
current_value = df_test[current_diff_column].iloc[i]
if(current_value <= max_difference[current_diff_column]):
current_row.append('True')
df_result.loc[len(df_result)] = current_row
else:
current_row.append('False')
df_result.loc[len(df_result)] = current_row
else:
current_row.append(math.nan)
df_result.loc[len(df_result)] = current_row
print(df_result)
output:
V A B C D minimum_difference is_within_max_target
0 -10.0 NaN NaN NaN NaN NaN NaN
1 -9.9 10.0 1.0 -2200.0 100.0 B True
2 -9.8 11.0 2.0 -2211.0 1.0 D False