I've recently started to play around with Pandas in order to manipulate some data and I am now trying to anonymize a few columns after a groupBy
to find unique occurrences for persons.
For example, suppose the following DF:
First Name Last Name DOB
0 Bob One 28/05/1973
1 Bob One 28/05/1973
2 Ana Two 28/07/1991
3 Ana Two 28/07/1991
4 Ana Two 28/07/1991
5 Jim Three 07/01/1994
I can easily find unique person by First Name, Last Name and DOB by using df.groupby(['First Name', 'Last Name', 'DOB'])
.
However, I'd like to apply a function to every unique combination that would transform those names to a known anonymized (incremental) version.
First Name Last Name DOB
0 F1 L1 28/05/1973
1 F1 L1 28/05/1973
2 F2 L2 28/07/1991
3 F2 L2 28/07/1991
4 F2 L2 28/07/1991
5 F3 L3 07/01/1994
I've tried a few things with transform
and apply
functions of DF groupBy but with no lucky so far. How could I achieve this?
CodePudding user response:
ids = (df.groupby(["FirstName", "LastName", "DOB"], sort=False)
.ngroup().add(1)
.astype(str))
df["FirstName"] = "F" ids
df["LastName"] = "L" ids
identify the IDs of firstname, lastname and DOB triples over their group number (ngroup)
- sort=False helps it retain the seen order
- ngroup is 0-based, so we add(1)
then add prefixes "F" and "L" to the IDs and assign to appropriate columns
to get
>>> df
FirstName LastName DOB
0 F1 L1 28/05/1973
1 F1 L1 28/05/1973
2 F2 L2 28/07/1991
3 F2 L2 28/07/1991
4 F2 L2 28/07/1991
5 F3 L3 07/01/1994
where the ids
was
>>> ids
0 1
1 1
2 2
3 2
4 2
5 3
dtype: object
CodePudding user response:
You can use pd.factorize
:
>>> df.assign(**{'First Name': pd.factorize(df['First Name'])[0],
'Last Name': pd.factorize(df['Last Name'])[0]})
First Name Last Name DOB
0 0 0 28/05/1973
1 0 0 28/05/1973
2 1 1 28/07/1991
3 1 1 28/07/1991
4 1 1 28/07/1991
5 2 2 07/01/1994