I have DataFrame which looks like this:
car release_year class
1. BMW 5 2017 standard
2. Ford Mondeo 2016 standard
3. Mercedes S 2018 premium
4. Volvo V60X90 2017 premium
5. Opel Astra NaN standard
6. VW Passat 2014 standard
7. Lexus LF-1 NaN premium
I need to fill in NaN with average year from the same class. So in the first step I need to calculate the average release year for standard and separately for premium class cars and then that average to put in the table where right now there are NaN values.
CodePudding user response:
Starting with
import pandas as pd
import numpy as np
df = pd.DataFrame([
[1, "BMW 5", 2017, "standard"],
[2, "Ford Mondeo", 2016, "standard"],
[3, "Mercedes S", 2018, "premium"],
[4, "Volvo V60X90", 2017, "premium"],
[5, "Opel Astra", np.nan, "standard"],
[6, "VW Passat", 2014 , "standard"],
[7, "Lexus LF - 1", np.nan, "premium"],
], columns= ["index", "car", "release_year", "class"]).set_index("index")
group by class, select release_year, then use .transform("mean") to replace all values with their group's mean.
mean = df.groupby("class")["release_year"].transform("mean")
mean
>>
index
1 2015.666667
2 2015.666667
3 2017.500000
4 2017.500000
5 2015.666667
6 2015.666667
7 2017.500000
Name: release_year, dtype: float64
then replace the NaN values in release_year
df.loc[df["release_year"].isna(), "release_year"] = mean.round(0)
df
>>
car release_year class
index
1 BMW 5 2017.0 standard
2 Ford Mondeo 2016.0 standard
3 Mercedes S 2018.0 premium
4 Volvo V60X90 2017.0 premium
5 Opel Astra 2016.0 standard
6 VW Passat 2014.0 standard
7 Lexus LF - 1 2018.0 premium
I chose to round the average years but you can consider using np.ceil or np.floor as well.
CodePudding user response:
Considering your literal description, one could use:
df['release_year'] = (df['release_year']
.fillna(df.groupby('class')['release_year']
.transform('mean'))
)
But this gives you decimal years... It's this really what you want?
Output:
car release_year class
1 BMW 5 2017.000000 standard
2 Ford Mondeo 2016.000000 standard
3 Mercedes S 2018.000000 premium
4 Volvo V60X90 2017.000000 premium
5 Opel Astra 2015.666667 standard
6 VW Passat 2014.000000 standard
7 Lexus LF-1 2017.500000 premium
CodePudding user response:
import pandas as pd
import numpy as np
df = pd.DataFrame([
[1, "BMW 5", 2017, "standard"],
[2, "Ford Mondeo", 2016, "standard"],
[3, "Mercedes S", 2018, "premium"],
[4, "Volvo V60X90", 2017, "premium"],
[5, "Opel Astra", np.nan, "standard"],
[6, "VW Passat", 2014 , "standard"],
[7, "Lexus LF - 1", np.nan, "premium"],
], columns= ["index", "car", "release_year", "class"]).set_index("index")
to get the mean for the release_year column you can use:
mean_df = df['release_year'].mean()
print(int(mean_df))
#2016
then you need to search for the NAN values, by searching for np.NaN (you need to import numpy for this, pandas doesn't have the NaN value as such )
df.loc[df["release_year"].isna(), "release_year"] = int(mean_df)
df
car release_year class
index
1 BMW 5 2017.0 standard
2 Ford Mondeo 2016.0 standard
3 Mercedes S 2018.0 premium
4 Volvo V60X90 2017.0 premium
5 Opel Astra 2016.0 standard
6 VW Passat 2014.0 standard
7 Lexus LF - 1 2016.0 premium