A function that assigns a random value to an empty cell in a specific column-CodePudding

I have a dataset with info about clothes, one of the columns is 'color', this column has 15% missing values. I'm trying to write a function that will assign a 'random' color to the clothes that their color value is missing (while taking into consideration the probability of it being a certain color).

white     0.194729
black     0.149217
silver    0.121210
grey      0.097715
blue      0.086823
red       0.085831
green     0.027132
brown     0.023690
custom    0.022386
yellow    0.004960
orange    0.004493
purple    0.001984



for row in data[data['color'].isnull()]:
???????????????????

I'm completely lost

CodePudding user response：

Is this what you are basically looking for?

import numpy as np
import pandas as pd

# df with colors and their probabilities
df_prob = pd.DataFrame({'color': ['red', 'blue', 'yellow'],
                        'prob':  [.3, .5, .2]})

# set a seed
np.random.seed(0)

# create a dummy data
df = pd.DataFrame({'COL' : np.random.randint(0,10, size=10)})
# put some NaNs in the data
df.iloc[np.random.choice(df.index, size=5, replace=False)] = np.nan

# actual solution:
# fill the gaps with random draw from df_prob
df.loc[df.COL.isna(), 'COL'] = np.random.choice(df_prob.color, 
                                                size=df.COL.isna().sum(),
                                                replace=True, 
                                                p=df_prob.prob)