Home > Net >  Add Categorical Column with Specific Count
Add Categorical Column with Specific Count

Time:11-23

I'm trying to create a new categorical column of countries with specific percentage values. Take the following dataset, for instance:

df = sns.load_dataset("titanic")

I'm trying the following script to get the new column:

country = ['UK', 'Ireland', 'France']

df["country"] = np.random.choice(country, len(df))

df["country"].value_counts(normalize=True)

UK         0.344557
Ireland    0.328844
France     0.326599

However, I'm getting all the countries with equal count. I want specific count for each country:

Desired Output

df["country"].value_counts(normalize=True)

UK         0.91
Ireland    0.06
France     0.03

What would be the ideal way of getting the desired output? Any suggestions would be appreciated. Thanks!

CodePudding user response:

Do you want to change the probabilities of numpy.random.choice?

df["country"] = np.random.choice(country, len(df), p=[0.91, 0.06, 0.03])
df["country"].value_counts(normalize=True)

Output:

UK         0.902357
Ireland    0.058361
France     0.039282
Name: country, dtype: float64

If you want a exact number of values (within the limit of the precision):

p = [0.91, 0.06, 0.03]
r = (np.array(p)*len(df)).round().astype(int) # the sum MUST be equal to len(df)
# or
# r = [811,  53,  27]

a = np.repeat(country, r)
np.random.shuffle(a)

df['country'] = a

df["country"].value_counts(normalize=True)

Output:

UK         0.910213
Ireland    0.059484
France     0.030303
Name: country, dtype: float64
  • Related