This is going to be a rather long post to cover all the edge cases and with examples for clarity. A sample of my input data is as below:
df = pd.DataFrame({"Set" : [100, 100, 110, 110, 130, 130, 130, 140, 140, 150, 150, 150, 160, 170, 170],
"measure" : [np.nan, np.nan, 11, 10, np.nan, np.nan, np.nan, np.nan, np.nan, 10, 13, 8, np.nan, 12, 13],
"width" : [0.19, 0.18, 0.2, 0.27, 0.18, 0.17, 0.21, 0.19, 0.16, 0.19, 0.24, 0.3, 0.15, 0.32, 0.28]})
df
Set measure width
0 100 NaN 0.19
1 100 NaN 0.18
2 110 11.0 0.20
3 110 10.0 0.27
4 130 NaN 0.18
5 130 NaN 0.17
6 130 NaN 0.21
7 140 NaN 0.19
8 140 NaN 0.16
9 150 10.0 0.19
10 150 13.0 0.24
11 150 8.0 0.30
12 160 NaN 0.15
13 170 12.0 0.32
14 170 13.0 0.28
I want to be able to fill the NaN base on these conditions in this order of IF-ELSE statement ( I will include example explanation for each condition to be clearer):
- Find the absolute difference between the NaN "Set" and its closest neighbors and fill the NaN with the max value from that closest neighbor - e.g: for index 0 and 1, absolute difference between set 100 and set 110 is 10. Then fill index 0 and 1 NaN with the max value from set 110 which is "11.0" and include the "adjusted width" column which is the width 0.2 for the NaN row at index 0 and 1.
- If the above absolute difference between its closest neighbors is the same for two neighbors, fill NaN with the max value of the "measure" of that set. e.g: for index 4, 5, 6 that is on set "130", the absolute difference between set 110 and set 130 is 20; also the absolute difference between set 150 and set 130 is also 20. There tis has 2 closest neighbors with same difference. To fill the NaN, we will consider which of the two sets has the max value of "measure". In this case, neighbor 'set 150' has the max value of "measure" which is "13.0". Then fill index 4, 5 and 6 NaN with the max value from set 150 which is "13.0" and include the "adjusted width" column which is the width 0.24 for the NaN row at index 4, 5, 6. (Note, you cannot fill set 150 with that of set 140 eventhough it is closest because 140 is NaN in the first place and was initially not provided).
- If condition 1 and 2 above are the same, fill NaN with the max value of the "width" of that set. e.g for index 12, that is on set 160, its closest neighbors are set 150 and set 170 each with absolute difference of 10, meaning condition 1 is a tie, then proceeding to condition 2, the max value of "measure" for both set 150 and set 170 is "13.0". To fill the NaN, we will consider which of the two sets has the max value of "width". In this case, neighbor 'set 170' has the max value of "width" which is "0.28". Then fill index 12 NaN with the max value of "measure" corresponding to the row with the max value of "width" from set 170 which is "13.0" and include the "adjusted width" column which is the width 0.28 for the NaN row at index 14.
Note Set 140 NaN satisfies condition 1 with set 150 being its closest neighbor and is filled with 13.0 from set 150 and adjusted width of 0.24 from set 150.
My final output table will be as below:
Set measure width Adjusted Width
0 100 11.0 0.19 0.2
1 100 11.0 0.18 0.2
2 110 11.0 0.20 0.2
3 110 10.0 0.27 0.27
4 130 13.0 0.18 0.24
5 130 13.0 0.17 0.24
6 130 13.0 0.21 0.24
7 140 13.0 0.19 0.24
8 140 13.0 0.16 0.24
9 150 10.0 0.19 0.19
10 150 13.0 0.24 0.24
11 150 8.0 0.30 0.30
12 160 13.0 0.15 0.28
13 170 12.0 0.32 0.32
14 170 13.0 0.28 0.28
I know this is long, I needed to cover all cases, please leave me a comment if you need further clarifications. Thanks in advance.
CodePudding user response:
There you go. Comments will go a long way helping you understand the flow, but more or less translated your written logic into code. Also, added an additional condition_met
column to help you see which condition was met for the different cases. This can be optimised for sure, but it will certainly provide a solid starting point.
The crux of it, if there is one, is to filter the original dataframe at every step according to your conditions and run checks against it.
The print
statements in the else
blocks are just in case something goes wrong in real life - but they should not be executed.
import pandas as pd
import numpy as np
df = pd.DataFrame({"set" : [100, 100, 110, 110, 130, 130, 130, 140, 140, 150, 150, 150, 160, 170, 170],
"measure" : [np.nan, np.nan, 11, 10, np.nan, np.nan, np.nan, np.nan, np.nan, 10, 13, 8, np.nan, 12, 13],
"width" : [0.19, 0.18, 0.2, 0.27, 0.18, 0.17, 0.21, 0.19, 0.16, 0.19, 0.24, 0.3, 0.15, 0.32, 0.28]})
# we will create a copy of the output df to write the results
# otherwise when we write the original df, the conditions are change
# e.g. when we get to checking set 140, set 130 is resolved, meaning that
# set 140 will now meet condition 3, and not condition 1 as desired
output_df = df.copy()
output_df['adjusted_width'] = output_df['width']
# we will just loop over unique sets instead of the dataframe for some efficiency
for set_ind in df['set'].unique():
# here we check that all measures are NaN but that might be conservative
if all(np.isnan(df[df['set']==set_ind]['measure'])):
# making sure that we remove nan values from measure when we check for the minimum or when we find the neighbours
# though that can be extracted in a separate step
distance_from_closest_neighbour = min(np.abs(df[(df['set']!=set_ind) & (~np.isnan(df['measure']))]['set']-set_ind))
only_closest_neighbours_df = df[(np.abs(df['set']-set_ind)==distance_from_closest_neighbour) & (~np.isnan(df['measure']))]
# if there is only one closest neighbour, then we have met condition 1
if only_closest_neighbours_df['set'].unique().shape[0] == 1:
# executing condition 1: get the max measure from the df, the width and set them on the output df
max_measure_from_neighbour = max(only_closest_neighbours_df['measure'])
width_of_max_measure_from_neighbour = only_closest_neighbours_df[only_closest_neighbours_df['measure']==max_measure_from_neighbour]['width'].iloc[0]
output_df.loc[output_df['set'] == set_ind, 'measure'] = max_measure_from_neighbour
output_df.loc[output_df['set'] == set_ind, 'adjusted_width'] = width_of_max_measure_from_neighbour
output_df.loc[output_df['set'] == set_ind, 'condition_met'] = 'condition 1'
# condition 1 not met - must be 2 closest neighbours
elif only_closest_neighbours_df['set'].unique().shape[0] == 2:
# get the max measure of all closest neighbours and check how many neighbours have that value in measure
max_measure_from_all_neighbours = max(only_closest_neighbours_df['measure'])
closest_neighbours_with_max_measure = only_closest_neighbours_df[only_closest_neighbours_df['measure']==max_measure_from_all_neighbours]
# if only 1 of the closest neighbours has the max value of measure, then we have met condition 2
if closest_neighbours_with_max_measure['set'].unique().shape[0] == 1:
# executing condition 2: set the max measure and the width on the output df
width_of_max_measure_from_neighbour_for_cond_2 = closest_neighbours_with_max_measure[closest_neighbours_with_max_measure['measure']==max_measure_from_all_neighbours]['width'].iloc[0]
output_df.loc[output_df['set'] == set_ind, 'measure'] = max_measure_from_all_neighbours
output_df.loc[output_df['set'] == set_ind, 'adjusted_width'] = width_of_max_measure_from_neighbour_for_cond_2
output_df.loc[output_df['set'] == set_ind, 'condition_met'] = 'condition 2'
# if both closest neighbours have the max value of measure, then we have met condition 3
elif closest_neighbours_with_max_measure['set'].unique().shape[0] == 2:
# executing condition 3: set the max measure and the width of the output df
closest_neighbours_with_max_measure_and_max_width = closest_neighbours_with_max_measure.sort_values('width', ascending=False).iloc[0]
output_df.loc[output_df['set'] == set_ind, 'measure'] = closest_neighbours_with_max_measure_and_max_width['measure']
output_df.loc[output_df['set'] == set_ind, 'adjusted_width'] = closest_neighbours_with_max_measure_and_max_width['width']
output_df.loc[output_df['set'] == set_ind, 'condition_met'] = 'condition 3'
else:
print(f'Something went wrong - {set_ind}')
else:
print(f'Something went wrong 2 - {set_ind}')
the output then is:
set measure width adjusted_width condition_met
0 100 11.0 0.19 0.20 condition 1
1 100 11.0 0.18 0.20 condition 1
2 110 11.0 0.20 0.20 NaN
3 110 10.0 0.27 0.27 NaN
4 130 13.0 0.18 0.24 condition 2
5 130 13.0 0.17 0.24 condition 2
6 130 13.0 0.21 0.24 condition 2
7 140 13.0 0.19 0.24 condition 1
8 140 13.0 0.16 0.24 condition 1
9 150 10.0 0.19 0.19 NaN
10 150 13.0 0.24 0.24 NaN
11 150 8.0 0.30 0.30 NaN
12 160 13.0 0.15 0.28 condition 3
13 170 12.0 0.32 0.32 NaN
14 170 13.0 0.28 0.28 NaN