How can I remap values in a pandas column using a random draw from a list?-CodePudding

Context

I have a dataframe where I need to remap a column to different values. For some values the mapping is ambiguous, the resulting value should be chosen randomly from a list everytime the value to be mapped is encountered.

For example, the values in the columns should be remapped in the following way:

1 ➝ 'a'
2 ➝ 'b' or 'c', chosen at random
3 ➝ 'd'

If there are two rows with a 2, a random draw should be done each time to determine if the value should be mapped to b or to c.

Example data

Here is some example data:

import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4, 5, 6, 7, 8], "col2": [2, 2, 2, 3, 1, 2, 2, 1]})

What I've looked into

I've tried using map and a random.choice call with a mapping dictionary (as described in this answer):

choice_list = ["b", "c"]
map_dict = {1: "a", 2: random.choice(choice_list), 3: "d"}
df["remap"] = df.col2.map(map_dict)

I found that in the remapping of value 2, always a single value was chosen from the choice_list for all rows, e.g. all b's:

   col1  col2 remap
0     1     2     b
1     2     2     b
2     3     2     b
3     4     3     d
4     5     1     a
5     6     2     b
6     7     2     b
7     8     1     a

Something similar happens when I use the replace method.

My expected outcome would be something like:

   col1  col2 remap
0     1     2     b
1     2     2     c
2     3     2     b
3     4     3     d
4     5     1     a
5     6     2     b
6     7     2     c
7     8     1     a

CodePudding user response：

what is wrong

By doing the following, you select once and for all the replacement value, which you don't want.

map_dict = {1: "a", 2: random.choice(choice_list), 3: "d"}

how to fix it

You need to make your random choice every time your map.

For this change the map_dict format and use a small wrapper:

import random
map_dict = {1: ["a"], 2: ["b", "c"], 3: ["d"]}
df["remap"] = df.col2.map(lambda x: random.choice(map_dict[x]))

possible output:

   col1  col2 remap
0     1     2     c
1     2     2     b
2     3     2     c
3     4     3     d
4     5     1     a
5     6     2     b
6     7     2     c
7     8     1     a

faster alternative for large datasets

If you have many rows (tens of thousands), this alternative will be faster:

map_dict = {1: ["a"], 2: ["b", "c"], 3: ["d"]}
map_s = pd.Series(map_dict, name='remap').explode()

(df.merge(map_s, left_on='col2', right_index=True)
   .groupby(level=0).sample(1)
)