In order to make data anonymous, I need to replace original IDs with a different set of new IDs, but still have identical original IDs match across all fields after being replaced. The challenge is doing so elegantly across 4 different presentations of IDs in this Pandas DataFrame.
I have real-world data where numerical IDs are found in 4 possible formats:
- Within a list of strings
'["38", "15", "42"]'
- Within a list of numbers
[14, 42, 94]
- As integers
42
- As floats
1.0
Here is a generalized small DataFrame with all 4 data types.
df = pd.DataFrame([['["38", "15", "42"]', [14, 42, 94], 42, 1.0],\
['["8", "28"]', [1, 4], 8, 94.0], ['["12"]', [12], 12, 12.0]],\
columns = ['CommentsID','AgentID','CaseID','TicketID'])
df
| CommentsID | AgentID | CaseID | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['38', '15', '42'] | [14, 42, 94] | 42 | 1.0 |
| ['8', '28'] | [1, 4] | 8 | 94.0 |
| ['12'] | [12] | 12 | 12.0 |
For ease of use in a generalized example, I'm just adding 100 to generate a list of 'new IDs'. However, in the actual problem, the list of corresponding new IDs are randomly generated, so no adding 100 throughout to solve for this one.
orig_ids = list(range(100))
new_ids = [x 100 for x in orig_ids]
What I would like is to find the most efficient way to replace all of the original IDs in the dataframe with the new IDs across these four data types.
Best solution I have so far is to split into three parts:
- Tackle the float and int versions with a
replace()
function (this does not affect the lists, even withregex=True
):
df = df.replace(orig_ids, new_ids)
- For the lists of integers, use a far-from-Pythonic doubled for-loop to match indices on the ID lists:
def newIDnumbers(datacol):
newlist = []
for i in range(len(datacol)):
numlist = [orig_ids.index(x) for x in df.AgentID[i]]
newlistrow = []
for idx in range(len(numlist)):
newlistrow.append(new_ids[numlist[idx]])
newlist.append(newlistrow)
return newlist
df.AgentID = newIDnumbers(df.AgentID)
df
- For the lists of strings, build string lists of original ids and new ids, then use a far-from-Pythonic doubled for-loop to match indices on the ID lists:
str_orig_ids = [str(x) for x in orig_ids]
str_new_ids = [str(x) for x in new_ids]
def newIDstrings(datacol):
newlist = []
for i in range(len(datacol)):
numlist = [str_orig_ids.index(x) for x in datacol.str.findall(r'"(\d*)"')[i]]
newlistrow = []
for idx in range(len(numlist)):
newlistrow.append(str_new_ids[numlist[idx]])
newlist.append(newlistrow)
return newlist
df.CommentsID = [str(x) for x in newIDstrings(df.CommentsID)]
df
Does anyone have a more elegant and less computationally-heavy way to achieve this output?
df
| CommentsID | AgentID | CaseID | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['138', '115', '142'] | [114, 142, 194] | 142 | 100.0 |
| ['108', '128'] | [101, 104] | 108 | 194.0 |
| ['112'] | [112] | 112 | 112.0 |
CodePudding user response:
If adding 100 is sufficient,
import json
df['CaseID'] = df['CaseID'] 100
df['TicketID'] = df['TicketID'] 100
df['AgentID'] = df['AgentID'].apply(lambda x: list(map(lambda y: y 100, x)))
df['CommentsID'] = df['CommentsID'].apply(lambda x: json.dumps(list(map(lambda y: str(int(y) 100), json.loads(x)))))
For AgentID
and CommentsID
Series
, we can use apply
to apply transformation one by one.
For AgentID
, this is easier because it is a column of list
, so we only need to map each integer in the list
to the addition.
For CommentsID
, we need an additional step at the beginning to convert the list in str
into a python list
by json.loads
, and another additional step at the end to convert the python list
back to str
using json.dumps
.
CodePudding user response:
Here is one way using reshaping and factorize
to anonymize the IDs:
from ast import literal_eval
def df_factorize(df):
idx = df.index
s = df.reset_index(drop=True).stack() # stack to ensure consistent
s[:] = s.factorize()[0] # factors among all columns
return s.unstack().set_index(idx)
df2 = (df
.assign(CommentsID=df['CommentsID'].apply(literal_eval)) # extract as integers
.explode(['CommentsID', 'AgentID']) # lists to rows
.assign(CommentsID=lambda d: d['CommentsID'].astype(int))
.pipe(df_factorize) # anonymize
.groupby(level=0) # below to reshape
.agg({'CommentsID': lambda s: str(list(s.astype(str))), # to original form
'AgentID': list,
'CaseID': 'first',
'TicketID': 'first',
})
.astype(df.dtypes) # original dtypes
)
output:
CommentsID AgentID CaseID TicketID
0 ['0', '4', '2'] [1, 2, 5] 2 3.0
1 ['6', '7'] [3, 8] 6 5.0
2 ['9'] [9] 9 9.0
In case you want "real" anonymization, you could use a uuid:
from uuid import uuid4
def df_uuid(df):
idx = df.index
s = df.reset_index(drop=True).stack()
f = s.unique()
s = s.map(dict(zip(f, [str(uuid4()) for _ in range(len(f))])))
return s.unstack().set_index(idx)
output:
CommentsID AgentID CaseID TicketID
0 ['e9fa80ed-58e2-4a96-a8da-0bc0c3a87e14', 'c87ca55a-e0a4-4618-ab5a-2eaf68d5a70f', '7cba72a8-3f2a-42e7-a781-b8ed242ea2ac'] [65f95f56-877e-4a9f-be88-4302fe1f77ea, 7cba72a8-3f2a-42e7-a781-b8ed242ea2ac, 3cf48cd7-e1b0-4c7f-bbeb-41836e309511] 7cba72a8-3f2a-42e7-a781-b8ed242ea2ac 549f1768-39b5-4e74-a8dd-4edbf3bb591f
1 ['6e54d7d6-ae2f-4319-950e-7629128d518a', '8fef2e1c-99d1-4460-b8a4-f4dbbf213c5f'] [549f1768-39b5-4e74-a8dd-4edbf3bb591f, 98f82da0-fa9c-447a-943b-b11875736128] 6e54d7d6-ae2f-4319-950e-7629128d518a 3cf48cd7-e1b0-4c7f-bbeb-41836e309511
2 ['3396c9db-1244-4b2d-bb76-14bb0221778f'] [3396c9db-1244-4b2d-bb76-14bb0221778f] 3396c9db-1244-4b2d-bb76-14bb0221778f 3396c9db-1244-4b2d-bb76-14bb0221778f