I have a df with a column that looks like this:
id
11
22
22
333
33
333
This column is sensitive data. I want to replace each value with any random number but each random number should be maintain the same number across the same IDs.
For example, I want to make mask the data in the column like so:
id
123
987
987
456
00
456
Note the same IDs have the same value. How do I achieve this? I have thousands of IDs.
CodePudding user response:
i would suggest something like this:
from random import randint
df['id_rand'] = df.groupby('id')['id'].transform(lambda x: randint(1,1000))
>>> df
'''
id id_rand
0 11 833
1 22 577
2 22 577
3 333 101
4 33 723
5 333 101
CodePudding user response:
Here are two options to either generate a categorical (non random, id2
), or a unique random per original ID (id3
). In both case we
can use pandas.factorize
(or alternatively unique
, or pandas.Categorical
).
# enumerated categorical
df['id2'] = pd.factorize(df['id'])[0]
# random categorical
import numpy as np
s,ids = pd.factorize(df['id'])
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)
# alternative 1
ids = df['id'].unique()
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)
# alternative 2
df['id3'] = pd.Categorical(df['id'])
new_ids = np.random.choice(range(1000), size=len(df['id3'].cat.categories), replace=False)
df['id3'] = df['id3'].cat.rename_categories(new_ids)
Output:
id id2 id3
0 11 0 395
1 22 1 428
2 22 1 428
3 333 2 528
4 33 3 783
5 333 2 528