I have a dataset with info about clothes, one of the columns is 'color', this column has 15% missing values. I'm trying to write a function that will assign a 'random' color to the clothes that their color value is missing (while taking into consideration the probability of it being a certain color).
white 0.194729
black 0.149217
silver 0.121210
grey 0.097715
blue 0.086823
red 0.085831
green 0.027132
brown 0.023690
custom 0.022386
yellow 0.004960
orange 0.004493
purple 0.001984
for row in data[data['color'].isnull()]:
???????????????????
I'm completely lost
CodePudding user response:
Is this what you are basically looking for?
import numpy as np
import pandas as pd
# df with colors and their probabilities
df_prob = pd.DataFrame({'color': ['red', 'blue', 'yellow'],
'prob': [.3, .5, .2]})
# set a seed
np.random.seed(0)
# create a dummy data
df = pd.DataFrame({'COL' : np.random.randint(0,10, size=10)})
# put some NaNs in the data
df.iloc[np.random.choice(df.index, size=5, replace=False)] = np.nan
# actual solution:
# fill the gaps with random draw from df_prob
df.loc[df.COL.isna(), 'COL'] = np.random.choice(df_prob.color,
size=df.COL.isna().sum(),
replace=True,
p=df_prob.prob)