Context
I have a dataframe where I need to remap a column to different values. For some values the mapping is ambiguous, the resulting value should be chosen randomly from a list everytime the value to be mapped is encountered.
For example, the values in the columns should be remapped in the following way:
- 1 ➝ 'a'
- 2 ➝ 'b' or 'c', chosen at random
- 3 ➝ 'd'
If there are two rows with a 2
, a random draw should be done each time to determine if the value should be mapped to b
or to c
.
Example data
Here is some example data:
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 6, 7, 8], "col2": [2, 2, 2, 3, 1, 2, 2, 1]})
What I've looked into
I've tried using map
and a random.choice
call with a mapping dictionary (as described in this answer):
choice_list = ["b", "c"]
map_dict = {1: "a", 2: random.choice(choice_list), 3: "d"}
df["remap"] = df.col2.map(map_dict)
I found that in the remapping of value 2
, always a single value was chosen from the choice_list
for all rows, e.g. all b
's:
col1 col2 remap
0 1 2 b
1 2 2 b
2 3 2 b
3 4 3 d
4 5 1 a
5 6 2 b
6 7 2 b
7 8 1 a
Something similar happens when I use the replace
method.
My expected outcome would be something like:
col1 col2 remap
0 1 2 b
1 2 2 c
2 3 2 b
3 4 3 d
4 5 1 a
5 6 2 b
6 7 2 c
7 8 1 a
CodePudding user response:
what is wrong
By doing the following, you select once and for all the replacement value, which you don't want.
map_dict = {1: "a", 2: random.choice(choice_list), 3: "d"}
how to fix it
You need to make your random choice every time your map.
For this change the map_dict
format and use a small wrapper:
import random
map_dict = {1: ["a"], 2: ["b", "c"], 3: ["d"]}
df["remap"] = df.col2.map(lambda x: random.choice(map_dict[x]))
possible output:
col1 col2 remap
0 1 2 c
1 2 2 b
2 3 2 c
3 4 3 d
4 5 1 a
5 6 2 b
6 7 2 c
7 8 1 a
faster alternative for large datasets
If you have many rows (tens of thousands), this alternative will be faster:
map_dict = {1: ["a"], 2: ["b", "c"], 3: ["d"]}
map_s = pd.Series(map_dict, name='remap').explode()
(df.merge(map_s, left_on='col2', right_index=True)
.groupby(level=0).sample(1)
)