Missing data fills, fill with the mean-CodePudding

I have a dataset with a column corresponding to categorical data, being A, B, C, D and E, all of these categories correspond to test scores, and some of these scores are NaN values. In this case I want to fill in each of these missing values by the average of the grades. This would be so much easier if I could just use fillna(), however categories are all about grades.

I really appreciate the help.

And so I wanted some way to populate these NaN values as they belong to a group.

CodePudding user response：

if you have something like this

import pandas as pd
import numpy as np
df = pd.DataFrame(
    [
        [1,'A'],
        [2,'B'],
        [3,'C'],
        [4,np.nan],
        [5,'A'],
        [6,'B'],
        [7,np.nan],
        [8,'B'],
        [9,'C'],
        [10,'D'],
    ], columns=['id','grade'])

and you df

    id  grade
0   1   A
1   2   B
2   3   C
3   4   NaN
4   5   A
5   6   B
6   7   NaN
7   8   B
8   9   C
9   10  D

if we are finding the most occurrent of the grade with

df.groupby('grade').size().to_frame()

you can see that the frequency should be

    0
grade   
A   2
B   3
C   2
D   1

You may use mode() to find out the value by

df_mode=df.grade.mode().values[0]
df_mode

then you can fill the missing value with

df.grade=df.grade.fillna(df_mode)
df

and the result should be like this

CodePudding user response：

If you are looking to replace the values with the mean value based on the grouped categorical grade you can do it a number of ways but this is a pretty simple one:

  Grade  Score
0    A      95
1    A    NaN
2    B    NaN
3    B      83
4    B      85
5    B      81
6    C      73
7    C    NaN
8    C      75

df.Score = df.groupby("Grade").transform(lambda x: x.fillna(x.mean()))

This groups by the categorical grade, iterates over the Score column and if it is NA drops in the mean for that category.

This is a very simply method.