I have two tables, one with country, city and the probability a user can be from that city, and another table with users and their countries:
country | city | probability |
---|---|---|
USA | New York | 0.3 |
USA | Houston | 0.5 |
USA | LA | 0.2 |
user_id | country |
---|---|
1 | USA |
2 | USA |
3 | USA |
4 | USA |
5 | USA |
6 | USA |
In the second table, I want to create a new column (city) that randomly chooses a city from the table above based on the probability; so Houston should be chosen more often.
Expected output (random):
user_id | country | city |
---|---|---|
1 | USA | Houston |
2 | USA | Houston |
3 | USA | LA |
4 | USA | Houston |
5 | USA | New York |
6 | USA | New York |
I'm using pandas for this. Obviously there will be more countries and cities, and more users in different countries.
CodePudding user response:
You can use np.random.choice
to generate a random sample according to a given probability distribution:
In [27]: df = pd.DataFrame({"user_id": [1, 2, 3, 4, 5, 6], "country": ["USA"]*6})
In [28]: df
Out[28]:
user_id country
0 1 USA
1 2 USA
2 3 USA
3 4 USA
4 5 USA
5 6 USA
In [32]: df["city"] = np.random.choice(["New York", "Houston", "LA"], size=len(df), p=[0.3, 0.5, 0.2])
In [33]: df
Out[33]:
user_id country city
0 1 USA Houston
1 2 USA Houston
2 3 USA Houston
3 4 USA LA
4 5 USA Houston
5 6 USA New York
If users are in different countries you will need to do this once per country.
Edit: Solution which handles duplicates in the user_id
column consistently:
In [34]: df = pd.DataFrame({"user_id": [1, 2, 3, 4, 4, 1], "country": ["USA"]*6})
In [35]: df
Out[35]:
user_id country
0 1 USA
1 2 USA
2 3 USA
3 4 USA
4 4 USA
5 1 USA
In [36]: num_users = df["user_id"].nunique()
In [37]: num_users
Out[37]: 4
In [38]: cities = np.random.choice(["New York", "Houston", "LA"], size=num_users, p=[0.3, 0.5, 0.2])
In [39]: cities
Out[39]: array(['Houston', 'New York', 'LA', 'New York'], dtype='<U8')
In [40]: df["user_id"].unique()
Out[40]: array([1, 2, 3, 4], dtype=int64)
In [41]: mapping = {user_id: city for user_id, city in zip(df["user_id"], cities)}
In [43]: df["city"] = df["user_id"].apply(mapping.get)
In [44]: df
Out[44]:
user_id country city
0 1 USA Houston
1 2 USA New York
2 3 USA LA
3 4 USA New York
4 4 USA New York
5 1 USA Houston