I'm trying to create a new categorical column of countries with specific percentage values. Take the following dataset, for instance:
df = sns.load_dataset("titanic")
I'm trying the following script to get the new column:
country = ['UK', 'Ireland', 'France']
df["country"] = np.random.choice(country, len(df))
df["country"].value_counts(normalize=True)
UK 0.344557
Ireland 0.328844
France 0.326599
However, I'm getting all the countries with equal count. I want specific count for each country:
Desired Output
df["country"].value_counts(normalize=True)
UK 0.91
Ireland 0.06
France 0.03
What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!
CodePudding user response:
Do you want to change the probabilities of numpy.random.choice
?
df["country"] = np.random.choice(country, len(df), p=[0.91, 0.06, 0.03])
df["country"].value_counts(normalize=True)
Output:
UK 0.902357
Ireland 0.058361
France 0.039282
Name: country, dtype: float64
If you want a exact number of values (within the limit of the precision):
p = [0.91, 0.06, 0.03]
r = (np.array(p)*len(df)).round().astype(int) # the sum MUST be equal to len(df)
# or
# r = [811, 53, 27]
a = np.repeat(country, r)
np.random.shuffle(a)
df['country'] = a
df["country"].value_counts(normalize=True)
Output:
UK 0.910213
Ireland 0.059484
France 0.030303
Name: country, dtype: float64