Home > Blockchain >  Panda rename rows after grouping by columns
Panda rename rows after grouping by columns

Time:01-07

I've recently started to play around with Pandas in order to manipulate some data and I am now trying to anonymize a few columns after a groupBy to find unique occurrences for persons.

For example, suppose the following DF:

   First Name Last Name         DOB
0  Bob        One               28/05/1973
1  Bob        One               28/05/1973
2  Ana        Two               28/07/1991
3  Ana        Two               28/07/1991
4  Ana        Two               28/07/1991
5  Jim        Three             07/01/1994

I can easily find unique person by First Name, Last Name and DOB by using df.groupby(['First Name', 'Last Name', 'DOB']).

However, I'd like to apply a function to every unique combination that would transform those names to a known anonymized (incremental) version.

   First Name Last Name         DOB
0  F1         L1                28/05/1973
1  F1         L1                28/05/1973
2  F2         L2                28/07/1991
3  F2         L2                28/07/1991
4  F2         L2                28/07/1991
5  F3         L3                07/01/1994

I've tried a few things with transform and apply functions of DF groupBy but with no lucky so far. How could I achieve this?

CodePudding user response:

ids = (df.groupby(["FirstName", "LastName", "DOB"], sort=False)
         .ngroup().add(1)
         .astype(str))

df["FirstName"] = "F"   ids
df["LastName"]  = "L"   ids
  • identify the IDs of firstname, lastname and DOB triples over their group number (ngroup)

    • sort=False helps it retain the seen order
    • ngroup is 0-based, so we add(1)
  • then add prefixes "F" and "L" to the IDs and assign to appropriate columns

to get

>>> df
  FirstName LastName         DOB
0        F1       L1  28/05/1973
1        F1       L1  28/05/1973
2        F2       L2  28/07/1991
3        F2       L2  28/07/1991
4        F2       L2  28/07/1991
5        F3       L3  07/01/1994

where the ids was

>>> ids
0    1
1    1
2    2
3    2
4    2
5    3
dtype: object

CodePudding user response:

You can use pd.factorize:

>>> df.assign(**{'First Name': pd.factorize(df['First Name'])[0],
                 'Last Name': pd.factorize(df['Last Name'])[0]})

   First Name  Last Name         DOB
0           0          0  28/05/1973
1           0          0  28/05/1973
2           1          1  28/07/1991
3           1          1  28/07/1991
4           1          1  28/07/1991
5           2          2  07/01/1994
  • Related