Home > other >  How to fill NaN for categorical data randomly?
How to fill NaN for categorical data randomly?

Time:10-25

I have a table like this one:

Sex SchGend
M Boys
F Girls
NaN Mixed
NaN Boys

And I want to fill the NaNs values within this table (there are 100 hundred of them). The SchGend tells if the school is only for boys, only for girls or for both. Thus, to fill the 4th row I want to put M as the sex, but to fill the NaN for the mixed school I want to do it with random value. I have no idea on how to put a condition in the fillna method for pandas.

So that is my question: how can I do that? Any tips?

CodePudding user response:

First, fill the values for known values from the school information. Then fill the remaining randomly. You can use random.choices to generate a random sequence of "M" and "F" (There should be alternative functions in numpy.random if you prefer).

If you run the below, you will get different outcomes for the third record.

from io import StringIO
import random
import pandas as pd

data = """
Sex SchGend
M   Boys
F   Girls
NaN Mixed
NaN Boys
"""

x = pd.read_csv(StringIO(data), sep="\t")

# fill cases of boys or girls school
x.loc[x.SchGend == "Boys", "Sex"] = "M"
x.loc[x.SchGend == "Girls", "Sex"] = "F"

num_na = x.Sex.isna().sum()  # number of missing cases
x.loc[x.Sex.isna(), "Sex"] = random.choices(["M", "F"], k=num_na)
x
  • Related