Fill NaN base on several 'IFS' conditions-CodePudding

This is going to be a rather long post to cover all the edge cases and with examples for clarity. A sample of my input data is as below:

df = pd.DataFrame({"Set" : [100, 100, 110, 110, 130, 130, 130, 140, 140, 150, 150, 150, 160, 170, 170],
                  "measure" : [np.nan, np.nan, 11, 10, np.nan, np.nan, np.nan, np.nan, np.nan, 10, 13, 8, np.nan, 12, 13],
                  "width" : [0.19, 0.18, 0.2, 0.27, 0.18, 0.17, 0.21, 0.19, 0.16, 0.19, 0.24, 0.3, 0.15, 0.32, 0.28]})
df

    Set  measure     width
0   100  NaN         0.19
1   100  NaN         0.18
2   110  11.0        0.20
3   110  10.0        0.27
4   130  NaN         0.18
5   130  NaN         0.17
6   130  NaN         0.21
7   140  NaN         0.19
8   140  NaN         0.16
9   150  10.0        0.19
10  150  13.0        0.24
11  150  8.0         0.30
12  160  NaN         0.15
13  170  12.0        0.32
14  170  13.0        0.28

I want to be able to fill the NaN base on these conditions in this order of IF-ELSE statement ( I will include example explanation for each condition to be clearer):

Find the absolute difference between the NaN "Set" and its closest neighbors and fill the NaN with the max value from that closest neighbor - e.g: for index 0 and 1, absolute difference between set 100 and set 110 is 10. Then fill index 0 and 1 NaN with the max value from set 110 which is "11.0" and include the "adjusted width" column which is the width 0.2 for the NaN row at index 0 and 1.
If the above absolute difference between its closest neighbors is the same for two neighbors, fill NaN with the max value of the "measure" of that set. e.g: for index 4, 5, 6 that is on set "130", the absolute difference between set 110 and set 130 is 20; also the absolute difference between set 150 and set 130 is also 20. There tis has 2 closest neighbors with same difference. To fill the NaN, we will consider which of the two sets has the max value of "measure". In this case, neighbor 'set 150' has the max value of "measure" which is "13.0". Then fill index 4, 5 and 6 NaN with the max value from set 150 which is "13.0" and include the "adjusted width" column which is the width 0.24 for the NaN row at index 4, 5, 6. (Note, you cannot fill set 150 with that of set 140 eventhough it is closest because 140 is NaN in the first place and was initially not provided).
If condition 1 and 2 above are the same, fill NaN with the max value of the "width" of that set. e.g for index 12, that is on set 160, its closest neighbors are set 150 and set 170 each with absolute difference of 10, meaning condition 1 is a tie, then proceeding to condition 2, the max value of "measure" for both set 150 and set 170 is "13.0". To fill the NaN, we will consider which of the two sets has the max value of "width". In this case, neighbor 'set 170' has the max value of "width" which is "0.28". Then fill index 12 NaN with the max value of "measure" corresponding to the row with the max value of "width" from set 170 which is "13.0" and include the "adjusted width" column which is the width 0.28 for the NaN row at index 14.

Note Set 140 NaN satisfies condition 1 with set 150 being its closest neighbor and is filled with 13.0 from set 150 and adjusted width of 0.24 from set 150.

My final output table will be as below:


    Set  measure     width   Adjusted Width
0   100  11.0        0.19    0.2
1   100  11.0        0.18    0.2
2   110  11.0        0.20    0.2
3   110  10.0        0.27    0.27
4   130  13.0        0.18    0.24
5   130  13.0        0.17    0.24
6   130  13.0        0.21    0.24
7   140  13.0        0.19    0.24
8   140  13.0        0.16    0.24
9   150  10.0        0.19    0.19
10  150  13.0        0.24    0.24
11  150  8.0         0.30    0.30
12  160  13.0        0.15    0.28
13  170  12.0        0.32    0.32
14  170  13.0        0.28    0.28

I know this is long, I needed to cover all cases, please leave me a comment if you need further clarifications. Thanks in advance.

CodePudding user response：

There you go. Comments will go a long way helping you understand the flow, but more or less translated your written logic into code. Also, added an additional condition_met column to help you see which condition was met for the different cases. This can be optimised for sure, but it will certainly provide a solid starting point.

The crux of it, if there is one, is to filter the original dataframe at every step according to your conditions and run checks against it.

The print statements in the else blocks are just in case something goes wrong in real life - but they should not be executed.

import pandas as pd
import numpy as np

df = pd.DataFrame({"set" : [100, 100, 110, 110, 130, 130, 130, 140, 140, 150, 150, 150, 160, 170, 170],
                  "measure" : [np.nan, np.nan, 11, 10, np.nan, np.nan, np.nan, np.nan, np.nan, 10, 13, 8, np.nan, 12, 13],
                  "width" : [0.19, 0.18, 0.2, 0.27, 0.18, 0.17, 0.21, 0.19, 0.16, 0.19, 0.24, 0.3, 0.15, 0.32, 0.28]})

# we will create a copy of the output df to write the results
# otherwise when we write the original df, the conditions are change
# e.g. when we get to checking set 140, set 130 is resolved, meaning that
# set 140 will now meet condition 3, and not condition 1 as desired
output_df = df.copy()
output_df['adjusted_width'] = output_df['width']

# we will just loop over unique sets instead of the dataframe for some efficiency
for set_ind in df['set'].unique():
    # here we check that all measures are NaN but that might be conservative
    if all(np.isnan(df[df['set']==set_ind]['measure'])):
        # making sure that we remove nan values from measure when we check for the minimum or when we find the neighbours
        # though that can be extracted in a separate step
        distance_from_closest_neighbour = min(np.abs(df[(df['set']!=set_ind) & (~np.isnan(df['measure']))]['set']-set_ind))
        only_closest_neighbours_df = df[(np.abs(df['set']-set_ind)==distance_from_closest_neighbour) & (~np.isnan(df['measure']))]
        
        # if there is only one closest neighbour, then we have met condition 1
        if only_closest_neighbours_df['set'].unique().shape[0] == 1:
            # executing condition 1: get the max measure from the df, the width and set them on the output df
            max_measure_from_neighbour = max(only_closest_neighbours_df['measure'])
            width_of_max_measure_from_neighbour = only_closest_neighbours_df[only_closest_neighbours_df['measure']==max_measure_from_neighbour]['width'].iloc[0]
            
            output_df.loc[output_df['set'] == set_ind, 'measure'] = max_measure_from_neighbour
            output_df.loc[output_df['set'] == set_ind, 'adjusted_width'] = width_of_max_measure_from_neighbour
            output_df.loc[output_df['set'] == set_ind, 'condition_met'] = 'condition 1'
        
        # condition 1 not met - must be 2 closest neighbours
        elif only_closest_neighbours_df['set'].unique().shape[0] == 2:
            # get the max measure of all closest neighbours and check how many neighbours have that value in measure
            max_measure_from_all_neighbours = max(only_closest_neighbours_df['measure'])
            closest_neighbours_with_max_measure = only_closest_neighbours_df[only_closest_neighbours_df['measure']==max_measure_from_all_neighbours]

            # if only 1 of the closest neighbours has the max value of measure, then we have met condition 2
            if closest_neighbours_with_max_measure['set'].unique().shape[0] == 1:
                # executing condition 2: set the max measure and the width on the output df
                width_of_max_measure_from_neighbour_for_cond_2 = closest_neighbours_with_max_measure[closest_neighbours_with_max_measure['measure']==max_measure_from_all_neighbours]['width'].iloc[0]
                output_df.loc[output_df['set'] == set_ind, 'measure'] = max_measure_from_all_neighbours
                output_df.loc[output_df['set'] == set_ind, 'adjusted_width'] = width_of_max_measure_from_neighbour_for_cond_2
                output_df.loc[output_df['set'] == set_ind, 'condition_met'] = 'condition 2'
            
            # if both closest neighbours have the max value of measure, then we have met condition 3
            elif closest_neighbours_with_max_measure['set'].unique().shape[0] == 2:
                # executing condition 3: set the max measure and the width of the output df
                closest_neighbours_with_max_measure_and_max_width = closest_neighbours_with_max_measure.sort_values('width', ascending=False).iloc[0]
                output_df.loc[output_df['set'] == set_ind, 'measure'] = closest_neighbours_with_max_measure_and_max_width['measure']
                output_df.loc[output_df['set'] == set_ind, 'adjusted_width'] = closest_neighbours_with_max_measure_and_max_width['width']
                output_df.loc[output_df['set'] == set_ind, 'condition_met'] = 'condition 3'
            else:
                print(f'Something went wrong - {set_ind}')
        else:
            print(f'Something went wrong 2 - {set_ind}')

the output then is:


    set     measure     width   adjusted_width  condition_met
0   100     11.0    0.19    0.20    condition 1
1   100     11.0    0.18    0.20    condition 1
2   110     11.0    0.20    0.20    NaN
3   110     10.0    0.27    0.27    NaN
4   130     13.0    0.18    0.24    condition 2
5   130     13.0    0.17    0.24    condition 2
6   130     13.0    0.21    0.24    condition 2
7   140     13.0    0.19    0.24    condition 1
8   140     13.0    0.16    0.24    condition 1
9   150     10.0    0.19    0.19    NaN
10  150     13.0    0.24    0.24    NaN
11  150     8.0     0.30    0.30    NaN
12  160     13.0    0.15    0.28    condition 3
13  170     12.0    0.32    0.32    NaN
14  170     13.0    0.28    0.28    NaN