Home > Mobile >  How to replace values in Pandas column with random numbers per unique values (random categorical)?
How to replace values in Pandas column with random numbers per unique values (random categorical)?

Time:08-29

I have a df with a column that looks like this:

id   
11    
22
22
333
33
333

This column is sensitive data. I want to replace each value with any random number but each random number should be maintain the same number across the same IDs.

For example, I want to make mask the data in the column like so:

id   
123   
987
987
456
00
456

Note the same IDs have the same value. How do I achieve this? I have thousands of IDs.

CodePudding user response:

i would suggest something like this:

from random import randint

df['id_rand'] = df.groupby('id')['id'].transform(lambda x: randint(1,1000))
>>> df
'''
    id  id_rand
0   11      833
1   22      577
2   22      577
3  333      101
4   33      723
5  333      101

CodePudding user response:

Here are two options to either generate a categorical (non random, id2), or a unique random per original ID (id3). In both case we can use pandas.factorize (or alternatively unique, or pandas.Categorical).

# enumerated categorical
df['id2'] = pd.factorize(df['id'])[0]

# random categorical
import numpy as np
s,ids = pd.factorize(df['id'])
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 1
ids = df['id'].unique()
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 2
df['id3'] = pd.Categorical(df['id'])
new_ids = np.random.choice(range(1000), size=len(df['id3'].cat.categories), replace=False)
df['id3'] = df['id3'].cat.rename_categories(new_ids)

Output:

    id  id2  id3
0   11    0  395
1   22    1  428
2   22    1  428
3  333    2  528
4   33    3  783
5  333    2  528
  • Related