Home > database >  Missing data fills, fill with the mean
Missing data fills, fill with the mean

Time:01-10

I have a dataset with a column corresponding to categorical data, being A, B, C, D and E, all of these categories correspond to test scores, and some of these scores are NaN values. In this case I want to fill in each of these missing values by the average of the grades. This would be so much easier if I could just use fillna(), however categories are all about grades.

I really appreciate the help.

And so I wanted some way to populate these NaN values as they belong to a group.

CodePudding user response:

if you have something like this

import pandas as pd
import numpy as np
df = pd.DataFrame(
    [
        [1,'A'],
        [2,'B'],
        [3,'C'],
        [4,np.nan],
        [5,'A'],
        [6,'B'],
        [7,np.nan],
        [8,'B'],
        [9,'C'],
        [10,'D'],
    ], columns=['id','grade'])

and you df

    id  grade
0   1   A
1   2   B
2   3   C
3   4   NaN
4   5   A
5   6   B
6   7   NaN
7   8   B
8   9   C
9   10  D

if we are finding the most occurrent of the grade with

df.groupby('grade').size().to_frame()

you can see that the frequency should be

    0
grade   
A   2
B   3
C   2
D   1

You may use mode() to find out the value by

df_mode=df.grade.mode().values[0]
df_mode

then you can fill the missing value with

df.grade=df.grade.fillna(df_mode)
df

and the result should be like this

    id  grade
0   1   A
1   2   B
2   3   C
3   4   B
4   5   A
5   6   B
6   7   B
7   8   B
8   9   C
9   10  D

CodePudding user response:

If you are looking to replace the values with the mean value based on the grouped categorical grade you can do it a number of ways but this is a pretty simple one:

  Grade  Score
0    A      95
1    A    NaN
2    B    NaN
3    B      83
4    B      85
5    B      81
6    C      73
7    C    NaN
8    C      75

df.Score = df.groupby("Grade").transform(lambda x: x.fillna(x.mean()))

This groups by the categorical grade, iterates over the Score column and if it is NA drops in the mean for that category.

This is a very simply method.

  • Related