I have a dataset with a column corresponding to categorical data, being A, B, C, D and E, all of these categories correspond to test scores, and some of these scores are NaN values. In this case I want to fill in each of these missing values by the average of the grades. This would be so much easier if I could just use fillna(), however categories are all about grades.
I really appreciate the help.
And so I wanted some way to populate these NaN values as they belong to a group.
CodePudding user response:
if you have something like this
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[1,'A'],
[2,'B'],
[3,'C'],
[4,np.nan],
[5,'A'],
[6,'B'],
[7,np.nan],
[8,'B'],
[9,'C'],
[10,'D'],
], columns=['id','grade'])
and you df
id grade
0 1 A
1 2 B
2 3 C
3 4 NaN
4 5 A
5 6 B
6 7 NaN
7 8 B
8 9 C
9 10 D
if we are finding the most occurrent of the grade with
df.groupby('grade').size().to_frame()
you can see that the frequency should be
0
grade
A 2
B 3
C 2
D 1
You may use mode()
to find out the value by
df_mode=df.grade.mode().values[0]
df_mode
then you can fill the missing value with
df.grade=df.grade.fillna(df_mode)
df
and the result should be like this
id grade
0 1 A
1 2 B
2 3 C
3 4 B
4 5 A
5 6 B
6 7 B
7 8 B
8 9 C
9 10 D
CodePudding user response:
If you are looking to replace the values with the mean value based on the grouped categorical grade you can do it a number of ways but this is a pretty simple one:
Grade Score
0 A 95
1 A NaN
2 B NaN
3 B 83
4 B 85
5 B 81
6 C 73
7 C NaN
8 C 75
df.Score = df.groupby("Grade").transform(lambda x: x.fillna(x.mean()))
This groups by the categorical grade, iterates over the Score column and if it is NA drops in the mean for that category.
This is a very simply method.