Pandas fill na in column based on values in multiple columns-CodePudding

I have a dataframe as follows:

   model         year   price
                              
   Honda          2016    5.0
   Yamaha         2017    5.5
   Honda          2018    5.7
   Yamaha         2017    NaN
   Bajaj          2015    6.5
   Bajaj          2015    NaN
   Honda          2019    7.0
   Honda          2017    NaN

All I want to do is fill in the NaN in the price column based on model and year columns. For example, Yamaha 2017 is 5.5 in row 2, in row 4 it is NaN. The NaN should be replaced with 5.5.

This is just a sample dataset I have more than 1000 rows. The next step would be to fill in the value for the model based on the nearest year, only in cases where the model and year combination are not present in the dataset (I am not sure if I have any such case in my dataset hence this is not priority for now).

Any help in the correct direction is appreciated. Thanks.

CodePudding user response：

In case the model, year combinations do not exists in the data set we can make use of pd.DataFrame.merge.asof as suggested by dear mozway who is always a source of inspiration as follows:

df['price'] = df.groupby(['model', 'year'], group_keys=False)['price'].apply(lambda x: x.ffill().bfill())

# First we sort the original data set by year values
df.sort_values('year', inplace=True)

# Then we merge the original data set with a subset of the original 
# which contains no NaN values. Just note that I specified a tuple of 
# values in suffixes argument to distinguish between the columns whose 
# names are overlapped

df = (pd.merge_asof(df, df.loc[~ df.price.isnull()], on='year', direction='nearest', suffixes=('_x', ''))
      .loc[:, ['model_x', 'year', 'price']])

df.columns = df.columns.str.rstrip('_x')

    model  year  price
0   Bajaj  2015    6.5
1   Bajaj  2015    6.5
2   Honda  2016    5.0
3  Yamaha  2017    5.5
4  Yamaha  2017    5.5
5   Honda  2018    5.7
6  Yamaha  2018    5.7
7   Honda  2019    7.0
8   Honda  2020    7.0

I have created the following sample data for this case:

testdata = ''' model         year   price                          
   Honda          2016    5.0
   Yamaha         2017    5.5
   Honda          2018    5.7
   Yamaha         2017    NaN
   Yamaha         2018    NaN
   Bajaj          2015    6.5
   Bajaj          2015    NaN
   Honda          2019    7.0
   Honda          2020    NaN
'''