Python/Pandas: change NaN to valeues based on average from different column-CodePudding

I have DataFrame which looks like this:

     car   release_year   class    
     1. BMW 5   2017   standard
     2. Ford Mondeo   2016   standard
     3. Mercedes S   2018   premium
     4. Volvo V60X90   2017   premium
     5. Opel Astra   NaN   standard
     6. VW Passat   2014    standard
     7. Lexus LF-1   NaN   premium

I need to fill in NaN with average year from the same class. So in the first step I need to calculate the average release year for standard and separately for premium class cars and then that average to put in the table where right now there are NaN values.

CodePudding user response：

Starting with

import pandas as pd
import numpy as np
df = pd.DataFrame([
    [1, "BMW 5", 2017,   "standard"],
    [2, "Ford Mondeo", 2016,   "standard"],
    [3, "Mercedes S", 2018,   "premium"],
    [4, "Volvo V60X90", 2017,   "premium"],
    [5, "Opel Astra", np.nan,   "standard"],
    [6, "VW Passat", 2014 ,   "standard"],
    [7, "Lexus LF - 1", np.nan,   "premium"],
], columns= ["index", "car", "release_year", "class"]).set_index("index")

group by class, select release_year, then use .transform("mean") to replace all values with their group's mean.

mean = df.groupby("class")["release_year"].transform("mean")
mean
>> 
index
1    2015.666667
2    2015.666667
3    2017.500000
4    2017.500000
5    2015.666667
6    2015.666667
7    2017.500000
Name: release_year, dtype: float64

then replace the NaN values in release_year

df.loc[df["release_year"].isna(), "release_year"] = mean.round(0)
df
>>
                car  release_year     class
index                                      
1             BMW 5        2017.0  standard
2       Ford Mondeo        2016.0  standard
3        Mercedes S        2018.0   premium
4      Volvo V60X90        2017.0   premium
5        Opel Astra        2016.0  standard
6         VW Passat        2014.0  standard
7      Lexus LF - 1        2018.0   premium

I chose to round the average years but you can consider using np.ceil or np.floor as well.

CodePudding user response：

Considering your literal description, one could use:

df['release_year'] = (df['release_year']
                     .fillna(df.groupby('class')['release_year']
                               .transform('mean'))
                     )

But this gives you decimal years... It's this really what you want?

Output:

            car  release_year     class
1         BMW 5   2017.000000  standard
2   Ford Mondeo   2016.000000  standard
3    Mercedes S   2018.000000   premium
4  Volvo V60X90   2017.000000   premium
5    Opel Astra   2015.666667  standard
6     VW Passat   2014.000000  standard
7    Lexus LF-1   2017.500000   premium

CodePudding user response：

import pandas as pd
import numpy as np
df = pd.DataFrame([
    [1, "BMW 5", 2017,   "standard"],
    [2, "Ford Mondeo", 2016,   "standard"],
    [3, "Mercedes S", 2018,   "premium"],
    [4, "Volvo V60X90", 2017,   "premium"],
    [5, "Opel Astra", np.nan,   "standard"],
    [6, "VW Passat", 2014 ,   "standard"],
    [7, "Lexus LF - 1", np.nan,   "premium"],
], columns= ["index", "car", "release_year", "class"]).set_index("index")

to get the mean for the release_year column you can use:

mean_df = df['release_year'].mean()
print(int(mean_df))
#2016

then you need to search for the NAN values, by searching for np.NaN (you need to import numpy for this, pandas doesn't have the NaN value as such )

df.loc[df["release_year"].isna(), "release_year"] = int(mean_df)
df

    car release_year    class
index           
1   BMW 5   2017.0  standard
2   Ford Mondeo 2016.0  standard
3   Mercedes S  2018.0  premium
4   Volvo V60X90    2017.0  premium
5   Opel Astra  2016.0  standard
6   VW Passat   2014.0  standard
7   Lexus LF - 1    2016.0  premium