I've got a dataset with some missing values as " ?" in just one column I want to replace all missing values with other values in that column (Feature1) like this:
Feature1_value_counts = df.Feature1.value_counts(normalize=True)
the code above gives me the number I can use for frac in pandas Feature1 contains 15 set of unique values so it has 15 numbers (all percentage)
and now I need to just randomly replace " ?"s with those unique values (All strings) with that frac probability
I don't know how to do this using pandas!
I've tried loc() and iloc() and also some for and ifs I couldn't get there
CodePudding user response:
You can take advantage of the p
parameter of numpy.random.choice
:
import numpy as np
# ensure using real NaNs for missing values
df['Feature1'] = df['Feature1'].replace('?', np.nan)
# count the fraction of the non-NaN value
counts = df['Feature1'].value_counts(normalize=True)
# identify the rows with NaNs
m = df['Feature1'].isna()
# replace the NaNs with a random values with the frequencies as weights
df.loc[m, 'Feature1'] = np.random.choice(counts.index, p=counts, size=m.sum())
print(df)
Output (replaced values as uppercase for clarity):
Feature1
0 a
1 b
2 a
3 A
4 a
5 b
6 B
7 a
8 A
Used input:
df = pd.DataFrame({'Feature1': ['a', 'b', 'a', np.nan, 'a', 'b', np.nan, 'a', np.nan]})