How to replace values in Pandas column with random numbers per unique values (random categorical)?-CodePudding

I have a df with a column that looks like this:

This column is sensitive data. I want to replace each value with any random number but each random number should be maintain the same number across the same IDs.

For example, I want to make mask the data in the column like so:

Note the same IDs have the same value. How do I achieve this? I have thousands of IDs.

CodePudding user response：

i would suggest something like this:

from random import randint

df['id_rand'] = df.groupby('id')['id'].transform(lambda x: randint(1,1000))
>>> df
'''
    id  id_rand
0   11      833
1   22      577
2   22      577
3  333      101
4   33      723
5  333      101

CodePudding user response：

Here are two options to either generate a categorical (non random, id2), or a unique random per original ID (id3). In both case we can use pandas.factorize (or alternatively unique, or pandas.Categorical).

# enumerated categorical
df['id2'] = pd.factorize(df['id'])[0]

# random categorical
import numpy as np
s,ids = pd.factorize(df['id'])
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 1
ids = df['id'].unique()
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 2
df['id3'] = pd.Categorical(df['id'])
new_ids = np.random.choice(range(1000), size=len(df['id3'].cat.categories), replace=False)
df['id3'] = df['id3'].cat.rename_categories(new_ids)

Output:

    id  id2  id3
0   11    0  395
1   22    1  428
2   22    1  428
3  333    2  528
4   33    3  783
5  333    2  528