Home > Software design >  Replacing numerical IDs contained as lists of strings and lists of ints in a Pandas DataFrame
Replacing numerical IDs contained as lists of strings and lists of ints in a Pandas DataFrame

Time:02-08

In order to make data anonymous, I need to replace original IDs with a different set of new IDs, but still have identical original IDs match across all fields after being replaced. The challenge is doing so elegantly across 4 different presentations of IDs in this Pandas DataFrame.

I have real-world data where numerical IDs are found in 4 possible formats:

  1. Within a list of strings '["38", "15", "42"]'
  2. Within a list of numbers [14, 42, 94]
  3. As integers 42
  4. As floats 1.0

Here is a generalized small DataFrame with all 4 data types.

df = pd.DataFrame([['["38", "15", "42"]', [14, 42, 94], 42, 1.0],\
 ['["8", "28"]', [1, 4], 8, 94.0], ['["12"]', [12], 12, 12.0]],\
 columns = ['CommentsID','AgentID','CaseID','TicketID'])

df
| CommentsID            | AgentID         | CaseID  | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['38', '15', '42']    | [14, 42, 94]    | 42      | 1.0      |
| ['8', '28']           | [1, 4]          | 8       | 94.0     |
| ['12']                | [12]            | 12      | 12.0     |

For ease of use in a generalized example, I'm just adding 100 to generate a list of 'new IDs'. However, in the actual problem, the list of corresponding new IDs are randomly generated, so no adding 100 throughout to solve for this one.

orig_ids = list(range(100))
new_ids = [x   100 for x in orig_ids]

What I would like is to find the most efficient way to replace all of the original IDs in the dataframe with the new IDs across these four data types.

Best solution I have so far is to split into three parts:

  1. Tackle the float and int versions with a replace() function (this does not affect the lists, even with regex=True):
df = df.replace(orig_ids, new_ids)
  1. For the lists of integers, use a far-from-Pythonic doubled for-loop to match indices on the ID lists:
def newIDnumbers(datacol):
    newlist = []
    for i in range(len(datacol)):
        numlist = [orig_ids.index(x) for x in df.AgentID[i]]
        newlistrow = []
        for idx in range(len(numlist)):
            newlistrow.append(new_ids[numlist[idx]])
        newlist.append(newlistrow)
    return newlist

df.AgentID = newIDnumbers(df.AgentID)                     
df
  1. For the lists of strings, build string lists of original ids and new ids, then use a far-from-Pythonic doubled for-loop to match indices on the ID lists:
str_orig_ids = [str(x) for x in orig_ids]
str_new_ids = [str(x) for x in new_ids]

def newIDstrings(datacol):
    newlist = []
    for i in range(len(datacol)):
        numlist = [str_orig_ids.index(x) for x in datacol.str.findall(r'"(\d*)"')[i]]
        newlistrow = []
        for idx in range(len(numlist)):
            newlistrow.append(str_new_ids[numlist[idx]])
        newlist.append(newlistrow)
    return newlist

df.CommentsID = [str(x) for x in newIDstrings(df.CommentsID)]
df       

Does anyone have a more elegant and less computationally-heavy way to achieve this output?

df
| CommentsID            | AgentID         | CaseID  | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['138', '115', '142'] | [114, 142, 194] | 142     | 100.0    |
| ['108', '128']        | [101, 104]      | 108     | 194.0    |
| ['112']               | [112]           | 112     | 112.0    |

CodePudding user response:

If adding 100 is sufficient,

import json
df['CaseID'] = df['CaseID']   100
df['TicketID'] = df['TicketID']   100
df['AgentID'] = df['AgentID'].apply(lambda x: list(map(lambda y: y 100, x)))
df['CommentsID'] = df['CommentsID'].apply(lambda x: json.dumps(list(map(lambda y: str(int(y) 100), json.loads(x)))))

For AgentID and CommentsID Series, we can use apply to apply transformation one by one.

For AgentID, this is easier because it is a column of list, so we only need to map each integer in the list to the addition.

For CommentsID, we need an additional step at the beginning to convert the list in str into a python list by json.loads, and another additional step at the end to convert the python list back to str using json.dumps.

CodePudding user response:

Here is one way using reshaping and factorize to anonymize the IDs:

from ast import literal_eval

def df_factorize(df):
    idx = df.index
    s = df.reset_index(drop=True).stack()  # stack to ensure consistent
    s[:] = s.factorize()[0]                # factors among all columns
    return s.unstack().set_index(idx)

df2 = (df
   .assign(CommentsID=df['CommentsID'].apply(literal_eval))   # extract as integers
   .explode(['CommentsID', 'AgentID'])                        # lists to rows
   .assign(CommentsID=lambda d: d['CommentsID'].astype(int))
   .pipe(df_factorize)                                        # anonymize
   .groupby(level=0)                                          # below to reshape
   .agg({'CommentsID': lambda s: str(list(s.astype(str))),    # to original form
         'AgentID': list,
         'CaseID': 'first',
         'TicketID': 'first',
        })
   .astype(df.dtypes)                                         # original dtypes
)

output:

        CommentsID    AgentID  CaseID  TicketID
0  ['0', '4', '2']  [1, 2, 5]       2       3.0
1       ['6', '7']     [3, 8]       6       5.0
2            ['9']        [9]       9       9.0

In case you want "real" anonymization, you could use a uuid:

from uuid import uuid4

def df_uuid(df):
    idx = df.index
    s = df.reset_index(drop=True).stack()
    f = s.unique()
    s = s.map(dict(zip(f, [str(uuid4()) for _ in range(len(f))])))
    return s.unstack().set_index(idx)

output:

                                                                                                                 CommentsID                                                                                                             AgentID                                CaseID                              TicketID
0  ['e9fa80ed-58e2-4a96-a8da-0bc0c3a87e14', 'c87ca55a-e0a4-4618-ab5a-2eaf68d5a70f', '7cba72a8-3f2a-42e7-a781-b8ed242ea2ac']  [65f95f56-877e-4a9f-be88-4302fe1f77ea, 7cba72a8-3f2a-42e7-a781-b8ed242ea2ac, 3cf48cd7-e1b0-4c7f-bbeb-41836e309511]  7cba72a8-3f2a-42e7-a781-b8ed242ea2ac  549f1768-39b5-4e74-a8dd-4edbf3bb591f
1                                          ['6e54d7d6-ae2f-4319-950e-7629128d518a', '8fef2e1c-99d1-4460-b8a4-f4dbbf213c5f']                                        [549f1768-39b5-4e74-a8dd-4edbf3bb591f, 98f82da0-fa9c-447a-943b-b11875736128]  6e54d7d6-ae2f-4319-950e-7629128d518a  3cf48cd7-e1b0-4c7f-bbeb-41836e309511
2                                                                                  ['3396c9db-1244-4b2d-bb76-14bb0221778f']                                                                              [3396c9db-1244-4b2d-bb76-14bb0221778f]  3396c9db-1244-4b2d-bb76-14bb0221778f  3396c9db-1244-4b2d-bb76-14bb0221778f
  •  Tags:  
  • Related