Home > Mobile >  Generate random cities based on probabilities
Generate random cities based on probabilities

Time:11-18

I have two tables, one with country, city and the probability a user can be from that city, and another table with users and their countries:

country city probability
USA New York 0.3
USA Houston 0.5
USA LA 0.2
user_id country
1 USA
2 USA
3 USA
4 USA
5 USA
6 USA

In the second table, I want to create a new column (city) that randomly chooses a city from the table above based on the probability; so Houston should be chosen more often.

Expected output (random):

user_id country city
1 USA Houston
2 USA Houston
3 USA LA
4 USA Houston
5 USA New York
6 USA New York

I'm using pandas for this. Obviously there will be more countries and cities, and more users in different countries.

CodePudding user response:

You can use np.random.choice to generate a random sample according to a given probability distribution:


In [27]: df = pd.DataFrame({"user_id": [1, 2, 3, 4, 5, 6], "country": ["USA"]*6})

In [28]: df
Out[28]:
   user_id country
0        1     USA
1        2     USA
2        3     USA
3        4     USA
4        5     USA
5        6     USA

In [32]: df["city"] = np.random.choice(["New York", "Houston", "LA"], size=len(df), p=[0.3, 0.5, 0.2])

In [33]: df
Out[33]:
   user_id country      city
0        1     USA   Houston
1        2     USA   Houston
2        3     USA   Houston
3        4     USA        LA
4        5     USA   Houston
5        6     USA  New York

If users are in different countries you will need to do this once per country.


Edit: Solution which handles duplicates in the user_id column consistently:

In [34]: df = pd.DataFrame({"user_id": [1, 2, 3, 4, 4, 1], "country": ["USA"]*6})

In [35]: df
Out[35]:
   user_id country
0        1     USA
1        2     USA
2        3     USA
3        4     USA
4        4     USA
5        1     USA

In [36]: num_users = df["user_id"].nunique()

In [37]: num_users
Out[37]: 4

In [38]: cities = np.random.choice(["New York", "Houston", "LA"], size=num_users, p=[0.3, 0.5, 0.2])

In [39]: cities
Out[39]: array(['Houston', 'New York', 'LA', 'New York'], dtype='<U8')

In [40]: df["user_id"].unique()
Out[40]: array([1, 2, 3, 4], dtype=int64)

In [41]: mapping = {user_id: city for user_id, city in zip(df["user_id"], cities)}

In [43]: df["city"] = df["user_id"].apply(mapping.get)

In [44]: df
Out[44]:
   user_id country      city
0        1     USA   Houston
1        2     USA  New York
2        3     USA        LA
3        4     USA  New York
4        4     USA  New York
5        1     USA   Houston
  • Related