I've a dataframe like as follows,
import pandas as pd
data = {
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}
df = pd.DataFrame(data)
print(df)
brand model year price
0 Mercedes X 2011 NaN
1 Renault Y 2010 1000.4
2 Ford Z 2009 2000.3
3 Mercedes X 2010 1000.0
4 Mercedes X 2012 1100.3
5 Mercedes X 2020 3000.5
6 Renault Q 2011 NaN
I want to fill the missing values with the average of the observations containing year-1, year and year 1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,
2011 - 1 = 2010
2011 1 = 2012
The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3
The mean -> (1000.0 1100.3) / 2 = 1050.15
I've tried something as follows,
for c_key, _ in df.groupby(['brand', 'model', 'year']):
fc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'].isin([c_key[2] 1, c_key[2], c_key[2] - 1]))
)
sc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'] == c_key[2])
& (df['price'].isnull())
)
mean_val = df[fc]['price'].mean()
df.loc[sc, 'price'] = mean_val
print(df)
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby
for the values year-1, year, year 1, brand and model?
Thanks in advance.
CodePudding user response:
I think actually a more efficient way would be to sort by Brand
and then Year
, and then use interpolate
:
df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
Output:
>>> df
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 1000.40
That also handles all the columns.