Generate random cities based on probabilities-CodePudding

I have two tables, one with country, city and the probability a user can be from that city, and another table with users and their countries:

country	city	probability
USA	New York	0.3
USA	Houston	0.5
USA	LA	0.2

user_id	country
1	USA
2	USA
3	USA
4	USA
5	USA
6	USA

In the second table, I want to create a new column (city) that randomly chooses a city from the table above based on the probability; so Houston should be chosen more often.

Expected output (random):

user_id	country	city
1	USA	Houston
2	USA	Houston
3	USA	LA
4	USA	Houston
5	USA	New York
6	USA	New York

I'm using pandas for this. Obviously there will be more countries and cities, and more users in different countries.

CodePudding user response：

You can use np.random.choice to generate a random sample according to a given probability distribution:


In [27]: df = pd.DataFrame({"user_id": [1, 2, 3, 4, 5, 6], "country": ["USA"]*6})

In [28]: df
Out[28]:
   user_id country
0        1     USA
1        2     USA
2        3     USA
3        4     USA
4        5     USA
5        6     USA

In [32]: df["city"] = np.random.choice(["New York", "Houston", "LA"], size=len(df), p=[0.3, 0.5, 0.2])

In [33]: df
Out[33]:
   user_id country      city
0        1     USA   Houston
1        2     USA   Houston
2        3     USA   Houston
3        4     USA        LA
4        5     USA   Houston
5        6     USA  New York

If users are in different countries you will need to do this once per country.

Edit: Solution which handles duplicates in the user_id column consistently:

In [34]: df = pd.DataFrame({"user_id": [1, 2, 3, 4, 4, 1], "country": ["USA"]*6})

In [35]: df
Out[35]:
   user_id country
0        1     USA
1        2     USA
2        3     USA
3        4     USA
4        4     USA
5        1     USA

In [36]: num_users = df["user_id"].nunique()

In [37]: num_users
Out[37]: 4

In [38]: cities = np.random.choice(["New York", "Houston", "LA"], size=num_users, p=[0.3, 0.5, 0.2])

In [39]: cities
Out[39]: array(['Houston', 'New York', 'LA', 'New York'], dtype='<U8')

In [40]: df["user_id"].unique()
Out[40]: array([1, 2, 3, 4], dtype=int64)

In [41]: mapping = {user_id: city for user_id, city in zip(df["user_id"], cities)}

In [43]: df["city"] = df["user_id"].apply(mapping.get)

In [44]: df
Out[44]:
   user_id country      city
0        1     USA   Houston
1        2     USA  New York
2        3     USA        LA
3        4     USA  New York
4        4     USA  New York
5        1     USA   Houston