I have a dataframe as follows:
model year price
Honda 2016 5.0
Yamaha 2017 5.5
Honda 2018 5.7
Yamaha 2017 NaN
Bajaj 2015 6.5
Bajaj 2015 NaN
Honda 2019 7.0
Honda 2017 NaN
All I want to do is fill in the NaN in the price column based on model and year columns. For example, Yamaha 2017 is 5.5 in row 2, in row 4 it is NaN. The NaN should be replaced with 5.5.
This is just a sample dataset I have more than 1000 rows. The next step would be to fill in the value for the model based on the nearest year, only in cases where the model and year combination are not present in the dataset (I am not sure if I have any such case in my dataset hence this is not priority for now).
Any help in the correct direction is appreciated. Thanks.
CodePudding user response:
In case the model, year combinations do not exists in the data set we can make use of pd.DataFrame.merge.asof as suggested by dear mozway who is always a source of inspiration as follows:
df['price'] = df.groupby(['model', 'year'], group_keys=False)['price'].apply(lambda x: x.ffill().bfill())
# First we sort the original data set by year values
df.sort_values('year', inplace=True)
# Then we merge the original data set with a subset of the original
# which contains no NaN values. Just note that I specified a tuple of
# values in suffixes argument to distinguish between the columns whose
# names are overlapped
df = (pd.merge_asof(df, df.loc[~ df.price.isnull()], on='year', direction='nearest', suffixes=('_x', ''))
.loc[:, ['model_x', 'year', 'price']])
df.columns = df.columns.str.rstrip('_x')
model year price
0 Bajaj 2015 6.5
1 Bajaj 2015 6.5
2 Honda 2016 5.0
3 Yamaha 2017 5.5
4 Yamaha 2017 5.5
5 Honda 2018 5.7
6 Yamaha 2018 5.7
7 Honda 2019 7.0
8 Honda 2020 7.0
I have created the following sample data for this case:
testdata = ''' model year price
Honda 2016 5.0
Yamaha 2017 5.5
Honda 2018 5.7
Yamaha 2017 NaN
Yamaha 2018 NaN
Bajaj 2015 6.5
Bajaj 2015 NaN
Honda 2019 7.0
Honda 2020 NaN
'''